CN113191926B

CN113191926B - Method and system for identifying grain and oil crop supply chain hazard based on deep integrated learning network

Info

Publication number: CN113191926B
Application number: CN202110386309.9A
Authority: CN
Inventors: 孔建磊; 王小艺; 金学波; 苏婷立; 张家辉; 王珍妮
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2024-04-26
Anticipated expiration: 2041-04-12
Also published as: CN113191926A

Abstract

The invention provides a grain and oil crop supply chain hazard identification method based on a deep integrated learning network, which comprises the following steps: acquiring parameter characteristic information of grain and oil food in each link of a supply chain to generate a multi-source heterogeneous data set; vector coding and standard normalization preprocessing are carried out on each attribute characteristic; performing multi-granularity filtering scanning on the preprocessed multi-source heterogeneous data set to output a scanning result, and performing K-fold cross validation on the scanning result to obtain training data; the deep integration learning network for stacking and extracting multidimensional features carries out deep learning on the training data and outputs learning results of a plurality of sub-models; and fusing the multiple sub-model learning results through Gaussian mixture. The invention relies on the hazard monitoring data of each link of the grain and oil crop supply chain, and the application of the deep integrated learning algorithm improves the hazard identification accuracy and the risk assessment reliability of the grain and oil supply chain.

Description

Method and system for identifying grain and oil crop supply chain hazard based on deep integrated learning network

Technical Field

The application relates to the field of food safety, in particular to a method and a system for identifying a grain and oil crop supply chain hazard based on a deep integrated learning network.

Background

Bin is real and world safety, grain is a living necessity for human life, and is an important commodity for national life. The modern grain and oil supply chain in China covers links of raw material production, storage, processing, logistics, consumption and the like, and the links of more farmers, more enterprises and large processing capacity are all the most in the world. The safety of the utility model directly relates to the health of people and the state of prosperity, hundreds of millions of farmers are pulled up, closely related to the three agricultural problems directly affects regional agricultural production and rural economic development; meanwhile, hundreds of millions of countries are linked, the dietary safety and the health of the public are closely related, the market flourishing and the social stability of the food industry in China are directly influenced, and the safety of rice supply with no toxicity and nutrition meeting requirements has become a focus of attention of Chinese.

However, the safety situation of the grain and oil food supply chain in China is still severely complicated. The method is characterized in that the grain and oil crop supply chain safety is a process of multiple links related to planting, production and processing, circulation and storage, sales and consumption and the like, wherein hazard risk factors with different categories and degrees exist in any link, each factor is influenced by food diversity, data multisource isomerism, regional distribution difference, time variability and the like, and the problem of pollution caused by heavy metal and pesticide residues of the grain and oil crop planting source is solved firstly. During the planting and harvesting of crops, diseases and mildews can occur under the influence of factors such as diseases and insect pests or weather, and aflatoxins, ochratoxins and the like are generated; the process safety control in the links of storage, transportation, sales and the like is weak, and various unsafe process factors exist, so that important influence is caused on mycotoxins. According to statistics, about 3100 to tens of thousands of tons of grains in China are polluted by mycotoxins in processing, storage, transportation and sales, and the proportion is about 6.2% of the total annual yield of rice.

Aiming at the long-term hidden danger of various hazard risks in the rice supply chain, a grain and oil crop supply chain hazard identification method from production, processing, storage and transportation to circulation and sales is constructed, the type of hazard factors, content migration distribution and early warning index change on the supply chain are analyzed based on monitoring the external information and the internal operation information of the supply chain, and accurate and rapid hazard identification and risk classification are completed, so that various possible losses are avoided or reduced, and the method becomes an effective means for guaranteeing the safety of the whole supply chain. However, because the internal and external hazard generation processes in the rice food supply chain are complex and changeable, the information among the main bodies of each supply chain is asymmetric, the real-time monitoring and early warning means are relatively lacking, and how to effectively ensure the safety of rice food from production, processing, storage and transportation to circulation and sales is always a problem of common attention at home and abroad.

The identification of the grain and oil crop supply chain hazard relates to a plurality of method theories and evaluation indexes, and the disaster degree and the influence on social stability are various. At present, related work has demonstrated applicability in managing various aspects of food engineering and food science using artificial neural networks and machine learning methods. Over the past two decades, more and more technologies have been applied to the field of hazard identification, e.g., support vector machine methods have been used for rice production risk identification assessment; the random forest and binary tree theory is used for predicting outdated risks and product life cycles, and meanwhile, maintenance of a hazard identification system is reduced to the greatest extent; the Artificial Neural Network (ANN) is used for grading the catering places with potential hazard risks, and comprehensively utilizing various machine learning methods to identify the risks of food and predict the change grade. However, the existing hazard identification and analysis method still has difficulty in related to identification and risk evaluation of the hazard in the grain and oil food supply chain, and the identification reliability and accuracy still have technical defects: (1) The single-variable identification model is adopted to directly carry out identification analysis, the key features obtained by the model are limited, the classification and identification precision is low, and the single-variable identification model cannot be applied to a grain and oil supply chain monitoring platform formed by a large number of various sensors and the obtained multi-source heterogeneous data mining; (2) The risk level analysis is not carried out from the migration and diffusion angles of the supply chain, so that the multi-dimensional data mining and comprehensive identification are difficult to carry out on the mixed condition of various hazards; (3) The inherent characteristics of the multi-source heterogeneous data variables and the relationships between the variables are not available, so that the mass food safety monitoring data has negative influence on the hazard identification performance.

As described above, these hazard identification methods eventually lead to problems of different degrees, which are difficult to perform well on multi-dimensional and unbalanced data, and are more difficult to process, learn, fuse and form effective decisions on massive information acquired by widely distributed sensors in the supply chain links. Therefore, an effective and accurate solution is needed in the food safety field of the grain and oil crop supply chain, thereby improving the identification accuracy of harmful substances in the supply chain and the reliability of risk level analysis, realizing intelligent risk supervision of the whole industry chain of grain and oil crops, and providing scientific and reliable identification and evaluation results for relevant government supervision departments, enterprise factories and wide consumers.

Disclosure of Invention

In order to solve one of the technical problems, the invention provides a method and a system for identifying the hazards of a grain and oil crop supply chain based on a deep integrated learning network.

The first aspect of the embodiment of the invention provides a grain and oil crop supply chain hazard identification method based on a deep integration learning network, which comprises the following steps:

acquiring parameter characteristic information of grain and oil food in each link of a supply chain, and integrating the parameter characteristic information to generate a multi-source heterogeneous data set;

performing vector coding and standard normalization preprocessing on each attribute characteristic in the multi-source heterogeneous data set to obtain a multi-source heterogeneous data set containing standardized data;

Performing multi-granularity filtering scanning on the multi-source heterogeneous data set containing the standardized data to output a scanning result, and performing K-fold cross validation on the scanning result to obtain training data;

taking decision trees, random forests, gradient lifting trees and extreme trees as basic sub-modules, constructing a deep integration learning network for deep learning of the training data by multi-dimensional feature stacking extraction, and outputting learning results of a plurality of sub-models;

And fusing the plurality of sub-model learning results through Gaussian mixture to obtain a grain and oil crop supply chain hazard identification model, and identifying the grain and oil crop supply chain hazard according to the grain and oil crop supply chain hazard identification model.

Preferably, the process of vector encoding and standard normalization preprocessing of each attribute feature in the multi-source heterogeneous data set to obtain the multi-source heterogeneous data set containing the normalized data includes:

performing single-heat coding on each attribute characteristic in the multi-source heterogeneous data set;

the data output after the single thermal coding is subjected to weight embedding through bidirectional vectorization to obtain vector characteristics;

performing linear transformation on the numerical attribute features in the vector features to obtain normalized data;

And carrying out normal distribution standardization processing on the normalized data, and aggregating the numerical attribute characteristics to an approximate normal distribution state with a mean value of 0 and a variance of 1 to obtain a multi-source heterogeneous data set containing the standardized data.

Preferably, the process of performing multi-granularity filtering scan output scan result on the multi-source heterogeneous data set containing standardized data includes:

sliding sampling is carried out on an input multi-source heterogeneous data set containing standardized data from the beginning to the end by adopting a one-dimensional convolution kernel filter, so as to obtain a sub-sample vector;

each sub-sample is subjected to adjustment training of a complete random forest and a common random forest, and a probability vector is obtained;

And splicing all probability vectors to obtain a multi-source heterogeneous data set containing the characterization vector and outputting the multi-source heterogeneous data set as a scanning result.

Preferably, the process of performing K-fold cross validation on the scan result to obtain training data includes:

Dividing the multi-source heterogeneous data set containing the characterization vector into K sample subsets with equal sizes;

Sequentially traversing the K sample subsets with the same size, wherein the ith (i=1, 2.,. K) traversal takes the ith sample subset as a verification set, and all the rest sample subsets as training sets to train a model to obtain training data, and obtaining an evaluation result of each sample subset;

The average value of the K evaluation indexes is taken as the final evaluation index.

Preferably, the process of outputting the learning results of a plurality of sub-models after the deep learning of the training data by the deep integrated learning network based on decision trees, random forests, gradient lifting trees and extreme trees and constructed multi-dimensional feature stack extraction includes:

Preliminary training is carried out on the training data through a decision tree learner, a random forest learner, a gradient lifting tree learner, an extreme tree learner and a risk assessment model learner, so that first-layer data corresponding to each learner is obtained;

the first layer data corresponding to each learner are spliced with the training data respectively to obtain the input data of the second layer corresponding to each learner;

and stacking the decision tree learner, the random forest learner, the gradient lifting tree learner, the extreme tree learner and the risk assessment model learner, respectively training the input data of the second layer corresponding to each learner, and then outputting a plurality of learning results.

A second aspect of an embodiment of the present invention provides a grain and oil crop supply chain hazard identification system based on a deep ensemble learning network, the system comprising a processor configured with processor-executable operating instructions to perform operations comprising:

Preferably, the processor is configured with processor-executable operating instructions to perform the following operations:

and splicing all probability vectors to obtain a characterization vector and outputting the characterization vector as a scanning result.

The beneficial effects of the invention are as follows: according to the method, by means of the hazard monitoring data of all links of the grain and oil crop supply chain and applying the deep integration learning algorithm, the interrelationship of all links, the areas and the time-varying factors of the grain and oil crop supply chain is excavated, the influence of unbalanced multi-source heterogeneous data distribution, incomplete multi-dimensional characteristic information and the like on the identification system and method is reduced, and the hazard identification accuracy and the grain and oil supply chain risk assessment reliability are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flowchart of a method for identifying a grain and oil crop supply chain hazard based on a deep ensemble learning network according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a multi-granularity filtering scan of a multi-source heterogeneous data set containing standardized data according to embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of the principle of K-fold cross-validation of scan results according to embodiment 1 of the present invention;

Fig. 4 is a schematic diagram of a deep learning principle of the deep learning network according to embodiment 1 of the present invention;

FIG. 5 is a schematic diagram of the method for fusing multiple sub-model learning results by Gaussian mixture according to embodiment 1 of the present invention;

FIG. 6 is a schematic diagram of a confusion matrix drawn by applying the classification result of the deep ensemble learning model proposed by the embodiment of the present invention in the example;

FIG. 7 is a schematic diagram of the distribution of risk levels in various links of the supply chain in an example;

FIG. 8 is a schematic diagram showing the distribution of risk levels in various hazards of food products in an example;

Fig. 9 is a graphical representation of the distribution ratio of four major contaminants in the production, distribution, sales and purchase links.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is provided in conjunction with the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application and not exhaustive of all embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.

Example 1

As shown in fig. 1, the embodiment provides a method for identifying a hazard of a grain and oil crop supply chain based on a deep ensemble learning network, which includes:

S101, acquiring parameter characteristic information of grain and oil food in each link of a supply chain, and integrating the parameter characteristic information to generate a multi-source heterogeneous data set.

Specifically, in this embodiment, a great amount of monitoring information of the hazard is collected from multiple links of the grain and crop supply chain through various monitoring sensors and devices, such as a temperature and humidity sensor, an oxygen concentration sensor, a carbon dioxide concentration sensor and various pollutant detecting devices, so as to obtain sensor information such as temperature, humidity, oxygen concentration and the like in the links of grain and oil food production and processing, and quality safety data such as water content, mildew and grain aging degree in the links of storage, transportation, sales and consumption, and determine the hazard category and content result existing in each link. And transmitting the mass data to a cloud server through technologies such as the Internet of things and information transmission to store, calculate and mine and analyze, and finally obtaining the multidimensional heterogeneous data set.

The multi-dimensional heterogeneous data set covers a plurality of links from planting, production processing, storage logistics, and finally sales and consumption in markets, supermarkets and other catering places of a grain and oil crop supply chain; the regional source of the grain and oil products covers the main production area and great consumption province of Chinese grain food, and also belongs to densely populated areas; the food is made up of rice and processed products, wheat flour and processed products, corn raw grain and oil products, soybean raw material and oil paper skin, and other food processed products. Each data sample collected is composed of a number of factors, including: product name, nominal production enterprise and sampled enterprise information, sampling links and places, production and spot inspection dates, region types, food classification, spot inspection projects and results, standard method detection, maximum allowable limit and other production information; environmental data such as temperature, humidity, oxygen concentration and the like in places such as processing factories, warehouses, transport vehicles and the like in each supply chain link, and quality data such as moisture content, mildew degree, grain aging degree and the like; the sampling object comprises category, content and risk grade information of hazardous substances such as heavy metals, microorganisms, mycotoxins, pesticide residues and the like. From the objective characteristics of the grain food hazardous substances, qualitative indexes such as social attention, hazard degree, supervision accessibility and the like are covered, and quantitative indexes such as total annual yield, grain production, consumption price and the like are considered, and the characteristics of the data are expanded to obtain detailed data compositions shown in table 1.

TABLE 1

S102, carrying out vector coding and standard normalization preprocessing on each attribute characteristic in the multi-source heterogeneous data set to obtain the multi-source heterogeneous data set containing standardized data.

Specifically, in view of the fact that each attribute feature of the multi-source heterogeneous data set has category differences such as English, chinese characters, placeholders and the like, and has various structure type data such as numerical type, logic type, character type, floating point type and the like, if the attribute features are directly input into a cloud platform server or a computer for processing, the attribute features cannot be stored and analyzed directly. In addition, each attribute characteristic value does not conform to the approximate normal distribution, and poor training and testing results can be caused by directly inputting the dangerous object identification model and method. Moreover, the presence of abnormal data samples and noise samples in the data set can interfere with the correct fitting learning of the model, for example, for the same feature, the values of the feature in different samples can be far apart, and this phenomenon is particularly apparent in the actual grain and oil crop supply chain. Therefore, vector encoding of different category attributes and data types is required to be performed first to convert semi-structured, structured data into structured numeric values.

In the embodiment, bidirectional vectorization (Dicvectorizer) and independent thermal coding (OneHotEncoder) are adopted to carry out vector coding on each attribute characteristic in the multi-source heterogeneous data set, so that Euclidean distances among different attributes are approximately the same. Setting input m×l original data, wherein n different types of type features exist in the l column attributes, performing binarization single-heat coding on the different types of features by 0/1, and converting the original data format into m×n, so that each attribute column performs same-dimension expansion without affecting the size relation of the original features. Taking the production province attribute column as an example, 5 types of characteristics of Henan, shandong, heilongjiang, jiangsu and Anhui are selected as the basis, and after vector coding operation, the characteristics become Henan 10000, shandong 01000, heilongjiang 00100, jiangsu 00010 and Anhui 00001, unstructured region attribute codes are structured digital types, and the problem of digital size does not exist.

However, the province attribute of the multi-source heterogeneous data centralized production comprises a plurality of regions, a large amount of redundant sparse matrixes can be generated along with the rapid increase of the dimensions, the calculation amount of a computer is enhanced, the computer has a plurality of useless operations, the performance of the identification model is not improved, and the inherent relation between different dimensions is more difficult to embody. Therefore, in this embodiment, after performing single-hot encoding on each attribute feature in the multi-source heterogeneous data set, weight embedding is performed by bidirectional vectorization, so as to implement vector encoding with low-dimension projection and low calculation consumption, and the process is as follows:

First, a certain attribute feature x _i in the multi-source heterogeneous data set is subjected to one-hot encoding, as shown in the following formula:

Wherein the method comprises the steps of Is a kronecker function, α and x _i are two inputs, where the element is 1 when α=x _i, and 0 when not equal. If m _i is the number of possible values of attribute feature x _i, then/>Is a vector of length m _i.

Then, the attribute features are subjected to bi-directional vector embedding to obtain feature expression as shown in the following formula:

where ω _αβ is the weight connecting the single thermal coding layer and the bi-directional vector embedding layer and β is the index of the embedding layer. Taking one characteristic vector 'Heilongjiang 00100' in the provincial attribute column as an example, the vector is embedded into a low-dimensional matrix space after vector coding processing After linear projection mapping of the omega weight matrix with n multiplied by m dimensions, embedded vector features are obtained, and the specific process is as follows:

After the bidirectional vector embedding is used for representing all attribute characteristics, the inputs of all continuous variables are mapped and connected, so that compared with single-hot coding, the characteristic dimension is reduced, a large amount of computing resources and memory use are effectively avoided, and subsequent data processing and recognition model training are facilitated.

As can be seen from the multi-bit heterogeneous data set, the data set has multi-dimensional attribute features, the meanings and numerical ranges of all the attribute features are different, the feature values of different samples of the same attribute feature are far apart, and the model is misled to be trained correctly by some abnormal size data and discrete distribution rules. In order to make different attribute indexes specific and comparability, the model better understand the meaning of data, and eliminate the influence of attribute dimension, the embodiment performs linear normalization and standardization processing on the numerical attribute features in the multi-bit heterogeneous data set after vector encoding. And (3) performing linear transformation on the original data in the multi-bit heterogeneous data set to map the result to the range of [0,1], so as to realize the equal scaling of the original data in the multi-bit heterogeneous data set.

The normalization process is shown in the following formula:

Wherein, X is the original data in the multi-bit heterogeneous data set, X _norm is the normalized data, and X _max、X_min is the maximum value and the minimum value of the original data in the multi-bit heterogeneous data set, respectively. And providing a basic data set for the subsequent risk scoring and early warning model by the normalized data.

And then, carrying out normal distribution standardization processing on the feature results of the attribute features, and aggregating the attribute features to approximate normal distribution conditions with the mean value of 0 and the variance of 1. The normalization formula is as follows:

And finally limiting the data with far difference of all attribute feature values to the same range.

S103, carrying out multi-granularity filtering scanning on the multi-source heterogeneous data set containing the standardized data to output a scanning result, and carrying out K-fold cross validation on the scanning result to obtain training data.

Specifically, as shown in fig. 2, this embodiment proposes a convolution kernel sliding filter with an adjustable window size, and performs multi-granularity filtering scanning on an input multi-source heterogeneous data set containing standardized data. Firstly, inputting an M-dimensional multi-source heterogeneous data set containing standardized data, designing a one-dimensional convolution kernel filter with the length of k and the step length of s (the window size can be properly adjusted according to actual requirements), sliding and sampling the input multi-source heterogeneous data set containing standardized data from the beginning to the end to obtain sub-sample vectors of M '= (M-k) +s k-dimensional characteristics, then carrying out adjustment training on each sub-sample to obtain probability vectors with the length of P, and generating characterization vectors with the length of M' ×P as output after splicing.

The basic multi-granularity scanning can repeatedly sample all attribute features except the first and last one-dimensional attribute features, and the first and last one-dimensional attribute features are less sampled once, in practical application, the first and last one-dimensional attribute features are very important, the sampling of the two dimensions is reduced relative to other dimensions, the final result of the classifier can be influenced, thus before sliding filtering, the filling (packing) operation is carried out on the attribute features in the input multi-source heterogeneous data set containing standardized data, and 0-value features with the length of k-1 are respectively supplemented at two ends of the filling (packing) operation, so that the edge expansion of the P-dimensional sample is realized. After the optimization of the filling operation, the k-dimensional characteristic sub-sample vector is obtained asTherefore, each forest can generate a characterization vector with the length of M' x P (namely a probability vector converted and spliced by random forests), training results of the forest submodels are spliced, a multi-granularity filter scanning result is output, the problem of overfitting caused by excessive attention to sensitive information is effectively avoided, and poor identification results caused by potential information in samples can not be ignored.

The common model training process is to randomly select D samples from the training data set D as a training set and the rest as a test set, however, the data is used only once under the condition, the data is not fully utilized, and when the data unbalance or the data quantity is small, the effective model iterative training is difficult to complete. The present embodiment uses k-fold cross-validation for data partitioning, which is very similar to partitioning training/test sets, but is applicable to a greater number of subsets. As shown in fig. 3, the specific working principle is as follows: firstly, dividing a multi-source heterogeneous data set containing a characterization vector into K sample subsets with equal sizes; then sequentially traversing the subsets, wherein the ith (i=1, 2,., K) traversal takes the ith subset as a verification set, takes all other subsets as training sets to train the model, and obtains respective evaluation results E _i; finally, taking the average value of the K times of evaluation indexes as a final evaluation indexThe K-fold cross validation can avoid model overfitting caused by data set deviation and improper data set division.

S104, taking the decision tree, the random forest, the gradient lifting tree and the extreme tree as basic sub-modules, and constructing a deep integration learning network for multi-dimensional feature stacking extraction to perform deep learning on the training data and then outputting learning results of a plurality of sub-models.

Specifically, the core idea of the integrated stacking algorithm is to design a fusion learner to combine the results of multiple basic learners to achieve performance improvement. The method comprises the steps of constructing m sub-training sets in a random sampling mode, respectively and independently training T weak learners according to the sub-training sets, and then setting a fusion strategy to combine the T weak learners to obtain a final strong learner. However, in practical application, the data is often not subjected to normal distribution, and has potential internal association and mapping rules, and when the traditional machine learning method with a shallow structure is applied to the data, effective key information is difficult to mine, so that the problem of reduction of the identification effect of the dangerous matters is caused. Therefore, the embodiment relies on the deep learning concept, and uses Decision Tree (DT), random Forest (RF), gradient boosting Decision device (Gradient Boosting Decision, GBD) and extreme Random Tree (Extremely randomized Tree, ET) as base sub-modules to construct a deep integrated learning network algorithm for multi-dimensional feature stack extraction, and the basic structure of the deep integrated learning network algorithm is shown in fig. 4.

The deep integrated learning network algorithm consists of two layers of primary learners, wherein the first layer carries out preliminary training on input training data through five learners, a random forest and a gradient lifting tree are respectively used as two typical learning modes of bagging algorithm and lifting algorithm in integrated learning, the learning ability is excellent, the generalization ability is strong, the application is wide in various fields, an extreme random tree is used as a variant of the random forest, has good practical application effect, and a decision tree is adopted because of the characteristics of mature theory and insensitivity to abnormal constant values; then splicing the output result of the first layer with the training data input by the first layer to serve as training data of a second layer, stacking five learners of the first layer to train the training data of the second layer input by the learners of the first layer, and inducing and correcting a plurality of learning algorithm results of the upper layer; and finally, selecting a fusion method to obtain a final output result. Thus multidimensional integration of extraction modules to train dataAs input, the integrated classification h= { I, II, III, IV, V, VI, VII, VIII } is taken as output, and the learning process is as follows:

(1) Inputting grain training data It is characterized by a=a _j, j=1, 2,..n, training a first layer learner T ₁＝{t_x-y|t_1-1,t_1-2,t_1-3,t_1-4,t_1-5, where x is the number of integration layers and y is the learner type.

(2) H _1-1(x_i is obtained through an expert risk assessment model t _1-1); for the decision tree t _1-2, calculating the information gain ratio of each feature to all examples in D, selecting a feature A _g with the maximum information gain ratio, if the information gain ratio is smaller than a threshold epsilon, taking the class H _k with the maximum number of examples in D as the class of the node, constructing t _1-2, otherwise, dividing D into a plurality of non-empty subsets D _i according to each possible value of the feature A _g, constructing sub-nodes of the class with the maximum number of examples in D _i into t _1-2, and obtaining H _1-2(x_i); for a random forest t _1-3, randomly sampling and constructing m sub-data sets D _i, selecting random n A _j of A features for a sample D _i, obtaining the optimal partition point by using a method of establishing a decision tree, repeatedly obtaining a decision tree t _1-3-k, obtaining a result from each sub-tree, and obtaining h _1-3(x_i by adopting a majority voting mechanism); for the extreme random tree t _1-4, completely randomly selecting the optimal bifurcation attribute of the decision tree t _1-4-k, and training to obtain h _1-4(x_i); initializing gradient promote tree t _1-5 Iterating the grain data m times, and calculating residual error/>, of negative gradient of loss functionFitting the tree to { (x ₁,r_m1),...,(x_N,r_mN) }, resulting in a leaf node region R _mj, j=1, 2, for the mth tree, J, and estimating the value of the leaf node region to minimize the loss function, calculation/>Update/>Obtaining the final gradient lifting tree/>H _1-5(x_i) is obtained, and the first layer training result H₁＝{h_1-1(x_i),h_1-2(x_i),h_1-3(x_i),h_1-4(x_i),h_1-5(x_i)};

(3) Splicing the output result H ₁ of the first layer learner with the original input grain training data D from i=1 to m to obtain D _h＝{x′_i,y_i,

(4) Training a second layer learner T ₁＝{t_x-y-z|t_2-1-1,t_2-1-2,t_2-1-3,...,t_2-5-4,t_2-5-5, wherein x is the number of integrated layers, y is the learner type, z is the number of learner layers, and obtaining a second layer training result similar to step (2) H₂＝{h_2-1-1(x_i),h_2-1-2(x_i),h_2-1-3(x_i),...,h_2-5-4(x_i),h_2-5-5(x_i)};

(5) The fusion layer learner T is trained based on H ₂, returning h= { I, II, III, IV, V, VI, VII, VIII }.

S105, fusing the plurality of sub-model learning results through Gaussian mixture to obtain a grain and oil crop supply chain hazard identification model, and identifying the grain and oil crop supply chain hazard according to the grain and oil crop supply chain hazard identification model.

Specifically, on the basis of multiple classifiers, the design fusion module is a way of performing strong combination on multiple sub-models to form stronger characterization. Common fusion methods such as mean voting, linear mixing and the like all need to rely on better classification labels and more manual parameter adjustment, but are not applicable to hazard monitoring data of a grain and oil supply chain. Therefore, in this embodiment, the gaussian mixture method is selected to complete the result fusion of multiple sub-models, and the gaussian mixture is a variable distribution model widely applied in the field of mathematical statistics, and its probability density function isWherein mu and sigma are mean and variance of Gaussian distribution respectively, when the input is a plurality of Gaussian distribution probabilities, probability observation results of t multimodal distribution can be obtained from different probability distribution linear combinations, and effective fusion of a plurality of models is achieved.

The embodiment gives a single submodel probability variable x with Gaussian component probability of S104 based on a plurality of primary learners obtained by trainingWhere K is the number of packets, i is the ith packet index, and phi _i > 0 is the mixed component weight, which reflects the importance of the corresponding Gaussian distribution in the overall model. There is a constraint/>So that the overall probability distribution tends to be uniform. And N (x|mu _i,∑_i) represents the multivariate gaussian distribution probability of the i-th component, expressed as follows:

Where μ _i and Σ _i are the mean vector and covariance matrix, respectively, can be uniformly characterized with the implicit variable θ _i＝(μ_i,∑_i). Defining the combination mode of each component distribution as linearity, and redefining the maximum likelihood estimation of the hidden variable to obtain the likelihood function Optimizing by iterative stepwise maximum extremum method, and calculating the final output fusion result as/>The nature of gaussian fusion is to fuse several single gaussian models, making the model more complex, thus producing more complex samples, the principle of which is schematically shown in fig. 5. When the weight is set reasonably enough, samples with arbitrary distribution can be fitted, so that a Gaussian fusion method is used as a second fusion layer of the whole hazard identification model, the method has remarkable effects on accurate identification and risk classification of the hazards of a grain and oil crop supply chain, the influence of unbalanced multi-source heterogeneous data distribution, incomplete multi-dimensional characteristic information and the like on an identification system and the method can be effectively reduced, and the model efficiency and stability are improved.

Example 2

Corresponding to embodiment 1, this embodiment proposes a grain and oil crop supply chain hazard recognition system based on a deep ensemble learning network, the system comprising a processor configured with processor-executable operation instructions to perform the following operations:

The specific principle and calculation process of the system proposed in this embodiment may refer to the content described in embodiment 1, and will not be described herein.

According to the method, by means of the hazard monitoring data of all links of the grain and oil crop supply chain and applying the deep integration learning algorithm, the interrelationship of all links, the areas and the time-varying factors of the grain and oil crop supply chain is excavated, the influence of unbalanced multi-source heterogeneous data distribution, incomplete multi-dimensional characteristic information and the like on the identification system and method is reduced, and the hazard identification accuracy and the grain and oil supply chain risk assessment reliability are improved. The method according to the application is described in more detail below with reference to examples.

First, data used in an experiment will be described, wherein the data set of grains is a data set of 87 features describing 34170 examples, and the first 65 features are coefficient items after encoding string type data, and the last 21 features are numerical features, namely temperature, humidity, detection values and the like. The purpose of the experiment was how to divide the examples into eight classes (safety class, safer class, early warning class, lower risk class, medium risk class, higher risk class, high risk class, ultra high risk class), which is a multi-classification problem. The distribution of categories in the dataset is: safety level (10880), safer level (1752), early warning level (1288), lower superscalar level (2117), medium risk level (2175), higher risk level (1226), high risk level (1086), ultra high risk level (5760). And when the grain data set is applied, the feature of detection value of the hazard is removed so as to prevent the model from concentrating on the feature and neglecting other features, and in the grain early warning model, new data come in without the detection value, otherwise, the meaning of early warning is lost.

In the aspect of judging the classification result, the experiment only uses F1-Score as an evaluation index, and the formula is as follows:

Wherein TP is true positive, TN is true negative, FP is false positive, FN is false negative, precision is Precision, and Recall is Recall rate.

Thus, in the classification task, an accuracy score of 1.00 for category I means that each item marked as belonging to category I does belong to category I (but does not account for the number of incorrectly marked items in category I), while recall 1.00 means that each item in category I is marked as belonging to category I (but does not account for how many other items are incorrectly marked as belonging to category I). In general, accuracy and recall are not discussed separately. The value of F1-Score ranges from 0 to 1, 1 represents the best output result of the model, and 0 represents the worst output result of the model.

The deep integrated learning method and other different traditional classification algorithms are applied to experiments on grain products, including LR, KNN, SVM, extra Trends (ET), gradient Boosting (GB), random Forest (RF) and precision trend (DT), and all models of the experiment are trained and tested on an Intel cool i 7.6 GHz processor with four NVIDIA TESLA P GPUs and 256G RAMs. Experiments were performed using the grain data collected as described above. The training is performed on a training set at the time of the experiment, and then an evaluation is performed on a verification set to minimize overfitting, and when the best choice of training process and parameters is achieved, a final evaluation is performed on an unknown test set, and the reliability of the results of different algorithms is evaluated by analyzing performance indicators such as accuracy, precision, recall, etc. The accuracy of the risk assessment using 5-fold cross-validation for each model is given in table 2.

TABLE 2

From the results, it can be seen that ET, RF, GB, DT performs better than other algorithms than the method proposed by the present application, they do have better accuracy from a practical point of view. However, by using the integrated learning concept proposed by the present application, for the grain data classification problem, a set of four algorithms that perform best is used: ET, RF, GB, DT, a learner fusion strategy is used in combination. The final 5-fold cross-validation achieved a collective accuracy of 98.54%. In addition, experiments have tested different sets of three layers of best performance algorithms for combining four classifiers, but the results were practically not significantly different from the sets of two layers of best performance algorithms for combining four classifiers. In the context of scientific research, it is useful to distinguish between statistical and practical significance. For example, the accuracy of the combined four best performance algorithms using three layers obtained with 5-fold cross-validation is 98.65%, but in a practical sense this difference is not significant. Thus, the present application collects a two-layer set of four best performance algorithms.

Experiments show that the application gives the best results for risk assessment of the food supply chain for 7886 new examples, and that the model can be used to categorize the new examples. Specifically, the accuracy, recall, and F1-Score results of the risk assessor of the present application in each category are presented in Table 3, respectively.

TABLE 3 Table 3

The recall of security level (I) was 1.00, which means that 100% of the security level instances in the test set were assigned to security level categories. The accuracy of the security level (I) and F1-Score were also 1.00. For the safer level (II), the accuracy, recall and F1-Score were all 0.98. For the early warning class (III), the accuracy is 0.97, which indicates that there is an erroneous instance assigned to that class. Analysis of the false instances in this case indicates that these instances belong to the lower risk class (IV) category, and the grain classification section assigns them to the early warning class (III). The recall rate for the alert level class (III) was 0.97, indicating that 97% of the class instances for the alerts provided in the test set were assigned to that class. The lower exceeding grade (IV), the medium risk grade (V), the higher risk grade (VI), the high risk grade (VII) and the early warning grade (III) are similar in evaluation results, and the exceeding risk grade (VIII) and the safety grade (I) are similar in evaluation results.

Classification model accuracy is measured by a confusion matrix, which is an index of the result of the judgment model and is part of model evaluation. The confusion matrix drawn by the classification result of the deep integration learning model provided by the application is shown in fig. 6. Looking at the results, the classification results in the lower superscalar level (IV) and the medium risk level (V) are more confusing. Most incorrectly classified instances are assigned to lower superscalar categories (IV) and medium risk categories (V), which can affect the accuracy of both categories. Table 4 gives examples of 8 randomly selected but correctly classified foods, two for each food category.

TABLE 4 Table 4

The results show that the accuracy of grain classification can be improved by applying the deep integrated learning algorithm, the grain states can be classified according to various indexes in the whole grain industry chain, and grain safety early warning is realized.

Meanwhile, experimental data show that the risk level conditions of different supply chain link dangers are different, and the risk degree of the dangers can be gradually increased along with the time or gradually accumulated along with the progress of the supply chain. From the results, the grade I products in the grain industry chain, namely qualified products, account for the vast majority, while the products in the production and circulation links VIII account for a relatively large proportion, so that risks are very easy to occur and the products are accumulated in a large range. FIG. 7 is a distribution of risk levels in various links of the supply chain.

And the distribution of risk levels is also different for different hazard projects as shown in fig. 8, and the I/II/III levels of other hazard projects except aluminum, cadmium, chromium and deoxynivalenol occupy the vast majority, and although the risk level of the projects is seemingly great, the selected data of the experiment is randomly extracted from the acquired data, so the food risk on the market is far lower than the result. The distribution ratio of the four main pollutants in the links of production, circulation, sales and purchase according to the experimental results is shown in fig. 9.

In conclusion, according to experimental results of the implementation cases, the accuracy of grain classification can be improved through the model obtained by the Internet of things system and the improved deep integrated learning algorithm, the grain state can be classified according to various indexes in the whole grain industry chain, and grain safety early warning is achieved.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for identifying a grain and oil crop supply chain hazard based on a deep ensemble learning network, the method comprising:

Fusing the plurality of sub-model learning results through Gaussian mixture to obtain a grain and oil crop supply chain hazard identification model, and identifying the grain and oil crop supply chain hazard according to the grain and oil crop supply chain hazard identification model;

The deep integrated learning network consists of two layers of primary learners, wherein the first layer is formed by performing preliminary training on input training data through five learners, and a random forest and a gradient lifting tree are respectively used as two typical learning modes of bagging algorithm and lifting algorithm in integrated learning; then splicing the output result of the first layer with the training data input by the first layer to serve as training data of a second layer, stacking five learners of the first layer to train the training data of the second layer input by the learners of the first layer, and inducing and correcting a plurality of learning algorithm results of the upper layer; finally, a fusion method is selected to obtain a final output result so as to train data As input, the integrated classification h= { I, II, III, IV, V, VI, VII, VIII } is taken as output, and the learning process is as follows:

(1) Inputting grain training data The method is characterized by a=a _j, j=1, 2, & gt, n, training a first layer learner T ₁＝{t_x-y|t_1-1,t_1-2,t_1-3,t_1-4,t_1-5, wherein x is the number of integration layers and y is the learner type;

(2) H _1-1(x_i is obtained through an expert risk assessment model t _1-1); for the decision tree t _1-2, calculating the information gain ratio of each feature to all examples in D, selecting a feature A _g with the maximum information gain ratio, if the information gain ratio is smaller than a threshold epsilon, constructing t _1-2 by taking the class H _k with the maximum number of examples in D as the class of the node, otherwise, dividing D into a plurality of non-empty subsets D _i according to each possible value of the feature A _g, constructing sub-nodes of the class with the maximum number of examples in D _i into t _1-2, and obtaining H _1-2(x_i); for a random forest t _1-3, randomly sampling and constructing m sub-data sets D _i, selecting random n A _j of A features for a sample D _i, obtaining the optimal partition point by using a method of establishing a decision tree, repeatedly obtaining a decision tree t _1-3-k, obtaining a result from each sub-tree, and obtaining h _1-3(x_i by adopting a majority voting mechanism); for the extreme random tree t _1-4, completely randomly selecting the optimal bifurcation attribute of the decision tree t _1-4-k, and training to obtain h _1-4(x_i); initializing gradient promote tree t _1-5 Iterating the grain data m times, and calculating residual error/>, of negative gradient of loss functionFitting the tree to { (x ₁,r_m1),...,(x_N,r_mN) }, obtaining a leaf node region R _mj, j=1, 2, for the mth tree, J, and estimating the value of the leaf node region to minimize the loss function, calculatingUpdate/>Obtaining the final gradient lifting tree/>H _1-5(x_i) is obtained, and the first layer training result H₁＝{h_1-1(x_i),h_1-2(x_i),h_1-3(x_i),h_1-4(x_i),h_1-5(x_i)};

(3) Splicing the output result H ₁ of the first layer learner with the original input grain training data D from i=1 to m to obtain

2. The method of claim 1, wherein vector encoding and standard normalization preprocessing the attribute features in the multi-source heterogeneous data set to obtain the multi-source heterogeneous data set including normalized data comprises:

3. The method of claim 1, wherein the step of multi-granularity filtering the multi-source heterogeneous data set comprising standardized data to scan out scan results comprises:

4. A method according to claim 3, wherein the step of K-fold cross-validating the scan results to obtain training data comprises:

5. A deep ensemble learning network based grain and oil crop supply chain hazard identification system, comprising a processor configured with processor-executable operating instructions to perform the method of claim 1.

6. The system of claim 5, wherein the processor is configured with processor-executable operating instructions to perform operations comprising:

7. The system of claim 5, wherein the processor is configured with processor-executable operating instructions to perform operations comprising:

8. The system of claim 7, wherein the processor is configured with processor-executable operating instructions to perform operations comprising: