CN115062969A

CN115062969A - Early warning method for food safety risk

Info

Publication number: CN115062969A
Application number: CN202210676050.6A
Authority: CN
Inventors: 吕小毅; 左恩光; 陈晨; 陈程
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-16

Abstract

The invention relates to a food safety risk early warning method. A food safety risk early warning method comprises the following steps: (1) after normalization processing is carried out on detection data of detection samples, composition of structured representation is carried out according to the relevance among the detection samples; (2) sampling the sample data to obtain an example pair; (3) and after a GCN-based contrast learning model is established and contrast learning is carried out, the consistency of the example pairs is identified, and sample risk assessment is carried out. The early warning method for the food safety risk is named as a CSGNN algorithm framework, can more fully mine the structural information and the topological correlation information of the detection data, more efficiently finish early warning and control on the food safety risk, and has positive significance for avoiding the occurrence of food safety accidents.

Description

Early warning method for food safety risk

Technical Field

The invention belongs to food detection, and particularly relates to a food safety risk early warning method.

Background

The food safety problem is more and more concerned by international organization and people in the world, and the occurrence probability of food safety accidents can be effectively reduced by a scientific and efficient supervision and early warning scheme. At present, monitoring systems for guaranteeing food safety and quality are established in many international organizations and countries. Similarly, China gradually perfects the national food safety risk assessment system. For example, in 2009, the china promulgated "food safety law of the people's republic of china". In 2011, the chinese food safety risk assessment Center (CFSA) was established. In 2018, the revised edition of the food safety laws of the people's republic of China takes food safety risk assessment as a scientific basis for implementing supervision and making standards. Therefore, the development of the food safety risk assessment method is helpful for systematizing and standardizing the Chinese food safety supervision system.

Existing mainstream risk early warning methods include hierarchical relationship analysis based methods, bayesian network based methods, and Artificial Neural Network (ANN) based methods. However, these methods exhibit the following disadvantages:

(1) they rely on supervised learning, but the process of detecting data labels and labeling manually leads to a great increase of time cost, and requires clear knowledge of the operator on the classification of data categories, and once an accidental error occurs on the classification of data categories, a series of subsequent tasks are continuously interfered by subjective factors, which is fatal in a practical application scene. The process of supervised learning on the raw data is shown as (a) in fig. 1.

(2) They use balanced training data or do not take into account class imbalance in the training data. Data class imbalance refers to a significant amount of difference in sample size of different tags in the data, which is common in practical scenarios. The class imbalance limits the performance of the model to different degrees, so that it is very critical to research how to adopt different strategies to solve the problem of data class imbalance while ensuring relatively good performance.

(3) They do not adequately capture topological information between test samples. Data obtained in the detection process has the characteristics of complexity, nonlinearity, discreteness and the like, which means that attribute information and topological information of the detected data need to be paid attention to as detailed as possible, so that early warning of food safety risks can be realized more accurately.

Contrast learning is a promising solution to the above limitations. The contrast learning adopts an automatic supervision mode to construct supervision information from data, so that the dependence on the manual label is essentially solved, and the processing process is shown as (b) in FIG. 1. The contrast learning focuses on learning common characteristics among similar examples and distinguishing differences among non-similar examples by modeling the relationship between each node and part of adjacent substructures of the node, and shows strong advantages in representation learning of graphs, especially anomaly detection in an attribute network. The learned embedding in the attribute network includes both attribute and structural information, which can effectively capture topology information and attribute information in the network, and fig. 2 shows three abnormal situations that the attribute network is dedicated to capture. The early warning task for food safety risks aims to mine all unqualified samples and qualified samples with potential risks, namely, abnormal samples with characteristic information different from most qualified samples are found, which is similar to the abnormal detection principle in the attribute network. The Graph Neural Network (GNN) is used for modeling complex relevance among sample individuals, so that the comparative learning based on the attribute network has potential application to the food safety risk early warning task.

Based on the above, the invention provides a novel food safety risk early warning model and an establishment method thereof, and a contrast type self-supervision learning framework (based on GNN)Contrastive Self-supervised learning-basedGraph Neural Network frame, abbreviated as CSGNN), can be used for early warning and control of food safety risks.

Disclosure of Invention

The invention aims to provide a method for establishing a food safety risk early warning model, which is based on a CSGNN algorithm framework, can more fully mine structural information and topological associated information of detection data, more efficiently finish early warning and control on food safety risk, and has positive significance for avoiding food safety accidents.

In order to realize the purpose, the adopted technical scheme is as follows:

a food safety risk early warning method comprises the following steps:

(1) after normalization processing is carried out on detection data of detection samples, composition of structured representation is carried out according to the relevance among the detection samples;

(2) sampling the sample data to obtain an example pair;

(3) and after a GCN-based contrast learning model is established and contrast learning is carried out, the consistency of the example pairs is identified, and sample risk assessment is carried out.

Further, in the step (2), the detection data are respectively normalized by the following formulas according to the classification of the forward index, the reverse index and the oscillatory index;

wherein the content of the first and second substances,

moreover, the sampling process in the step (2) sequentially comprises the following steps: determining pairs of sampled samples, sampled neighboring sample groups, hidden sampled samples and synthesized instances.

In addition, in the step (3), a classifier is adopted to identify the consistency of the example pairs, and the more the prediction scores of the positive and negative example pairs are closer to the middle value, the lower the risk is; conversely, the higher the risk.

In the step (3), after the comparative learning is performed, the prediction score is calculated by using the following formula:

wherein CLM (. cndot.) represents a comparative learning model, s _i In order to sample the samples, the sample is,

are adjacent sample groups;

and finally, carrying out sample risk evaluation, wherein the risk value of the sample is obtained by carrying out multi-sampling averaging calculation on the absolute value of the difference value of the prediction scores between the positive and negative example pairs, and the formula is as follows:

wherein, f(s) _i ) For the risk value of a sample, R is the number of sampling rounds.

Further, in the step (3), the comparison learning model includes a GCN module, a dimension reduction module and an embedding comparison identification module.

In addition, the dimensionality reduction module adopts the formula as follows:

wherein the content of the first and second substances,

for low-dimensional embedding features, (E) _i ) _m Insert E into neighboring sample groups _i M of ^th Line, n _i Refers to adjacent sample groups

The amount of sample in (1).

The embedding comparison and identification module adopts a formula as follows:

wherein, W ^(b) Refers to the weight matrix of the contrast identification module, and sigma (-) represents the sigmoid function.

Further, in the step (3), a binary cross entropy loss function is also adopted in the comparison learning model, so as to solve the problem of category imbalance in the food detection task.

Compared with the prior art, the invention has the beneficial effects that:

according to the technical scheme, food detection data are constructed into an attribute graph to enable the attribute graph to simultaneously contain attribute information and structure information; and then the training self-supervision contrast learning module is trained through positive and negative examples obtained by sampling the complete attribute graph. Therefore, the CSGNN can more fully mine the structural information and the contrast associated information of the detection data, and more efficiently finish early warning and control on the risk of food safety. Briefly, the food safety risk early warning framework provided by the invention has the main advantages that:

1. the invention provides an end-to-end food safety risk early warning and controlling framework, which can efficiently realize the detection of unqualified samples and the risk division of qualified samples in food detection data.

2. The invention provides a comparative self-supervision learning scheme for early warning of food safety risks, the dependence of the prior method on data category balance is essentially solved by the comparative learning, and the problems that manual labeling operation of data categories is easily interfered by subjective factors and the time cost is overhigh in practical application are effectively solved by the self-supervision learning.

3. The invention adopts GNN to transmit information, and comprehensively considers the attribute information and the structure information of the data nodes by constructing an attribute graph. To our knowledge, this is also the first time to apply the graph algorithm to the food safety risk pre-warning task.

4. Data in a practical scene verifies that the early warning effect of the algorithm is superior to that of the current mainstream model. The CSGNN framework and the mainstream model are subjected to instantiation comparison on the detection data of a batch of dairy products in a certain province in China, the unqualified sample-recall rate of the CSGNN reaches 1.0000, the CSGNN is improved by more than 13% compared with the suboptimal model, the precision value and the qualified sample-precision value reach 0.9829 and 1.0000 respectively, and the risk division of the food detection data is completed according to the risk value of each sample.

Drawings

FIG. 1 is a comparison of supervised learning and self-supervised learning; wherein, a is supervised learning, and b is self-supervised learning;

FIG. 2 illustrates three abnormal situations captured in the attribute network; wherein subgraph (a) belongs to a structural anomaly, i.e. there are connected nodes that do not match all attributes; sub-graph (b) belongs to an attribute anomaly, i.e., there are attributes that do not match all nodes; sub-graph (c) belongs to a synthetic anomaly, i.e., both structural and attribute anomalies exist;

FIG. 3 is a CSGNN overall framework diagram;

FIG. 4 is a three-dimensional visualization of sample data of an original portion of food inspection data;

FIG. 5 is a visualization of randomly selected 100 samples; wherein, (a) shows the dimensionality reduction distribution of the sterilized milk detection data before pretreatment; (b) the graph structure of an original graph is visualized, the node diameter is in direct proportion to the PageRank score, the node color is colored according to the sample category, and the edge color is determined by the source node color;

fig. 6 is a comparative learning completion example versus sampling process. Wherein, the node No. 4 is a sampling sample, and the node No. 9 is an unqualified sample;

fig. 7 is a risk ranking of the CSGNN framework on a sterilized dairy dataset.

Detailed Description

In order to further illustrate the method for warning food safety risk according to the present invention to achieve the intended purpose of the invention, the following detailed description is given to the specific implementation, structure, features and effects of the method for warning food safety risk according to the present invention with reference to the preferred embodiments. In the following description, different "one embodiment" or "an embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Before describing the early warning method of food safety risk in detail, it is necessary to further describe the related materials mentioned in the present invention to achieve better effect.

1. Food safety risk assessment model

The traditional food safety risk assessment models are various in types, and various assessment methods show different performances, particularly, Back Propagation (BP) can be trapped in a local optimal solution in a training process and cause training failure; a Support Vector Machine (SVM) can ensure that a global optimal solution is found by means of a convex optimization mode, and the SVM is applied to various detection tasks, but cannot fully mine the potential risk of food safety data; the bayesian network model is also applied to food safety assessment tasks, but it usually needs to be modeled in combination with expert knowledge, resulting in the performance of the model being limited by expert experience.

With the development of deep learning, a deep neural network model (DNN) has promising potential in the aspect of data feature mining, and a new idea is provided for food safety risk assessment. Nogales et al applied a multilayer perceptron (MLP) and a one-dimensional convolutional neural network (Conv1D) to the food risk prediction workflow of the rusff data in the european union, and combined with entity embedding, obtained better prediction accuracy than the machine learning model. Geng et al evaluated the risk of food inspection data using a Deep Radial Basis Function (DRBF) neural network of binding layer Analysis (AHP), enhancing the data representation of RBF shallow networks while avoiding falling into local optima. However, these models are greatly affected by subjective indexes, and cannot sufficiently capture correlation information between complex detection data. In contrast, the present invention proposes for the first time the use of GNN to assess food safety risks. The GNN is used for modeling individuals and the correlation among the individuals, can adapt to the interrelation of complex data, and has potential in the aspect of mining potential characteristic relations of food detection data.

2. Contrast learning

The contrast learning is an important branch of the self-supervision learning, the representation of data characteristics is completed in a mode of constructing a pair of examples and sending the examples into a contrast learning module, contrast objects and contrast loss respectively highlight inconsistency among non-similar types of data and similarity characteristics among the similar types of data, and the inconsistency and the similarity characteristics are matched with original targets of tasks such as downstream classification and detection.

As GNNs evolve, contrast learning is also applied to the training and learning of GNNs. DGI captures global structural information in the network by maximizing the mutual information between local and global inputs. GraphCL learns node embedding by maximizing the representation similarity between the intrinsic features and the link structure of a local subgraph for the same node. SUBLIME maximizes the mutual information between the anchor graph and the learning structure graph by comparing the losses. Due to the particularity of the risk assessment task, the invention hopes to focus on local information of the data rather than global information by using a comparison module, and the local information and the global information help the model to more efficiently mine the data characteristics of the risk sample.

After understanding the related materials mentioned in the present invention, the following will describe a food safety risk warning method in further detail with reference to specific embodiments:

the early warning and control of the effective food safety risk can obviously reduce the possibility of food safety accidents. The existing food safety risk early warning method relies on supervised learning, complex characteristic association among detection samples is not modeled, and the problem of unbalanced detection data types is not considered. In order to overcome the limitations, the invention provides a contrast-based self-supervision-based graph neural network framework (abbreviated as CSGNN) applied to early warning of food safety risks. Specifically, firstly, the relevance between detection samples is constructed, then a comparison learning positive and negative example pair based on an attribute network is defined, then a complex relation between the detection data samples is captured in a self-supervision mode, and finally the risk level of each sample is evaluated by using the absolute value of the difference value of the prediction scores of multiple rounds of positive and negative example pairs generated by CSGNN. In addition, the embodiment of the invention performs sample research on the detection data of a batch of dairy products in China, and the experimental result shows that the risk assessment performance of CSGNN on food data is superior to that of other baseline models, the AUC value and the unqualified sample-Recall ratio (Recall of unqualified samples) respectively reach 0.9188 and 1.0000, and the interpretable risk grade division can be provided for the food detection data. The research result can provide scientific guidance basis for related departments to develop early warning work, thereby reducing food risks.

Example 1.

Materials and methods

In this context, we use lower case letters in bold (e.g., X), upper case letters in bold (e.g., X), and italic letters (e.g., X)

) Representing vectors, matrices and sets, respectively. All symbols used herein are shown in table 1.

Table 1 description and explanation relating to the CSGNN framework. The three blocks in the table (from top to bottom) show the variable notations of data preprocessing and structured representation, graph convolution neural network (GCN) based contrast learning, and hyper-parameters of CSGNN, respectively.

TABLE 1

(I) problem definition and data Source

Definition of problems

Given a group of food detection data x with V detection indexes of N samples _v1 ,...,x _vn Firstly, an attribute map is constructed for the detection data

Herein, the

Is composed of

Node set of

Epsilon is

Is set, X belongs to R ^n×d Is that

Is given (d ═ V). The goal is to calculate each sample s _i (higher risk value means higher risk of the sample being present). And sorting the risk values of all samples, and grading the risk of the detection data according to the lowest risk value W of the unqualified sample and a more obvious boundary value U between the risk value of the risk sample and the risk value of the safety sample.

② data sources

This example uses raw data collected by the food inspection agency of Guizhou province, China. 2158 parts of sterilized dairy product test data were taken as the study subjects. To observe the model's detectability of class imbalance data, we chose to qualify samples from the study: failed sample 2117: 41. according to the national standard of food safety in China, the inspection indexes of the sterilized dairy product comprise five categories of sensory indexes, physicochemical indexes, pollutant indexes, mycotoxin indexes and microbial indexes. The selection of the food safety risk evaluation index should be considered in combination with operability, effectiveness and the like.

The embodiment is expected to scientifically select the corresponding evaluation index which may cause the food safety risk factor according to the obtained detection data. Since the microbial index in the detection data meets the requirement, the present example selects the physicochemical indexes including lactose, milk solids, protein, mycotoxin index, and the like as the evaluation criteria of food safety risk from the test indexes of sterilized milk specified by the national standard,The acidity and the fat are five indexes in total, and the mycotoxin index refers to aflatoxin M ₁ . The specific requirements and test methods for the six evaluation indexes are shown in Table 2.

TABLE 2 concrete requirements of six evaluation indexes and test method

Some samples of the sterilized dairy product assay data used in this example are shown in table 3.

TABLE 3 partial sample data of sterilized Dairy product inspection data

(II) food safety risk early warning based on contrast self-supervision learning

In this section, the overall framework of CSGNNs is described in detail, as shown in fig. 3. The CSGNN framework is composed of four parts, namely data preprocessing and structuring, comparative example pair sampling, GCN-based comparative learning and sample risk assessment.

Data preprocessing and structured representation

Part of the sample data of the food inspection data in table 3 was visualized in three dimensions, as shown in fig. 4 (test times in the figure are 2021.10.10, 2020.04.10, 2019.05.04, 2018.06.10, 2021.09.09 in order). From the figure, it can be seen that there are significant differences between the different risk assessment indicators.

In the embodiment, the original data is converted into unitless data by adopting a minimum-maximum normalization method so as to eliminate the dimension difference between different indexes. According to different requirements of the food safety standard on six evaluation indexes, the six evaluation indexes are divided into three types of forward indexes, reverse indexes and oscillation indexes, and the specific classification is shown in table 4. The forward index is an index of which the risk increases with the increase of the index value; the reverse index is an index of which the risk is reduced along with the increase of the index value; the oscillation index is an index with smaller risk when the index value is closer to a certain specified interval and larger risk when the index value is farther from the interval, and the three types of indexes are normalized by formulas (1) to (3) respectively. The larger the value of the data after normalization, the greater its risk.

TABLE 4 Sterilization Dairy product Risk assessment index type partitioning

Wherein the content of the first and second substances,

the embodiment is structurally represented according to the relevance among the detection samples. Specifically, the present embodiment represents the food detection samples as nodes in a graph, and represents each detection index of the samples as a node attribute in the graph, thereby constructing an attribute graph. When the sterilized milk detection data is composed, the distances between the samples are calculated respectively and are arranged in a descending order, and experiments in the embodiment show that the model has the best comprehensive performance when the front Z closest to the samples is 50 samples with edges and the rest samples are not with edges.

The present embodiment processes raw complex and discrete detection data into a structured representation suitable for GNNs in a pre-processing and patterning manner. In order to show the preprocessing and composition effects of the original detection data more clearly, 100 samples in the data set are randomly selected for visualization, and (a) and (b) in fig. 5 show the dimensionality reduction distribution before the preprocessing of the sterilized milk data and the network structure diagram after the preprocessing and composition, respectively.

② comparative example pair sampling

The definition of the comparison example pair is the core work of the comparison learning framework. Some of the previous work has demonstrated different advantages in terms of example pair definitions of the figures. Due to the fact that complex topological relations exist among different samples of food detection data, a food safety risk assessment framework is expected to be capable of comprehensively capturing attribute information and structural information of the samples.

The present embodiment focuses on modeling the relationship between the target node and its neighboring subgraphs to help mine the local information of the node. In particular, the present embodiment uses one instance pair of "sample vs neighbor sample group" for the attribute network in the CSGNN framework. The first element of the example pair refers to an arbitrary sample obtained from one traversal in the inspection data, and the second element of the example pair refers to a group of adjacent samples sampled from the initial sample. For a positive pair of instances, the initial sample is set to the sample, i.e., the sample neighboring group of samples matches the nearby sample of the sample; for the negative example pair, the initial sample is randomly drawn from all samples that do not include the sample, i.e., the initial sample does not come from the sample. Therefore, for a risk sample, there is a certain degree of mismatch between its sampling sample and the adjacent sample group, and a higher degree of mismatch represents a higher risk of the detection sample corresponding to the node.

Example pair sampling procedure in the CSGNN framework as shown in fig. 6, sampling includes four parts of determining a sample, sampling a group of neighboring samples, concealing the sample, and synthesizing an example pair.

(1) Determining a sampling sample: all samples in the test data are traversed in a random order within each epoch, randomly determining sample samples.

(2) Sampling of adjacent sample groups: for the adjacent sample groups of the positive and negative example pairs, we set their initial samples as sample samples and random sample samples, respectively. To make the sampling strategy of neighboring sample groups more efficient, we use RWR as the sampling strategy of local sample groups.

(3) Concealment of sample samples: in order to avoid the comparison learning model easily identifying the existence of the sampling samples in the adjacent sample group, the attribute characteristics of the initial samples are cleared. I.e. to hide the property information of the sample.

(4) Synthesis example pairs: and combining the sampling sample and the adjacent sample into an example pair and respectively storing the example pair and the adjacent sample into sample pools of positive and negative example pairs.

(iii) GCN-based contrast learning

Graph Neural Networks (GNNs) capture complex dependencies between data using information propagation between nodes, which greatly improves the performance of downstream tasks such as traffic flow prediction, system recommendation, text classification, motion recognition, and the like. The GCN is a multilayer graph convolution neural network for performing first-order local approximation on spectrogram convolution, and solves the problem that the CNN cannot keep translation invariance on discrete non-Euclidean data on the premise of keeping the effective processing space characteristic capability of the CNN. In the CSGNN framework proposed by the present invention, GCNs are selected as the backbone of GNN modules, which is an important component of the CSGNN framework. The sampled example pairs are used for training a GCN-based contrast learning model, and the example pairs of each batch

Performing an operation in which s _i Showing the sample samples in the example pair,

representing adjacent sample groups in the example pair, y _i A label representing the sample of the sample.The GCN-based comparison learning model mainly comprises a GCN module, a dimension reduction module and an embedding comparison identification module.

(1) A GCN module: the module efficiently excavates a sampling sample s _i And adjacent sample groups

And maps the embedding of the two parts into the same embedding space, which will provide for the comparison of the features after the two parts. The layer-by-layer propagation principle of GCN we use for adjacent groups of samples is shown in equation (4).

Here, the first and second liquid crystal display panels are,

finger quilt ^th Presentation matrix, input for hidden layer learning

Wherein X _i Is a matrix of attribute vectors, the output being labeled as a set of adjacent samples

Is embedded in E _i 。

Refers to adjacent sample groups

The degree matrix of (c) is,

refers to a contiguous matrix with self-connected subgraphs added, where I refers to the identity matrix.

Finger l ^th A trainable weight matrix is layered. Phi (-) representsReLU, etc. activation functions.

With adjacent sample groups

By contrast, sample s _i There is no structural information, therefore, we only need to use the weight matrix of the GCN and the corresponding activation function to complete the attribute information feature transformation, see formula (5).

Wherein

Is prepared from ^th Sample s learned by the hidden layer _i Is a characteristic of a row vector, W ^(l) Is a weight matrix shared with the GCN, input

Is defined as a sample s _i Is output as a sampled sample s _i Is embedded in

(2) A dimension reduction module: the module groups adjacent samples

High dimensional sample embedding in medium E _i Mapping to a low-dimensional embedding space to facilitate low-dimensional embedding features with sampled samples

And (6) comparing. The principle is shown in equation (6).

Wherein (E) _i ) _m Refers to adjacent samplesGroup embedding E _i M of ^th Line, n _i Refers to adjacent sample groups

The amount of sample in (1).

(3) The imbedding comparison and identification module: the module completes the embedding comparison of the sampling sample and the adjacent sample group, is a key part of a GNN-based comparison learning model, and is inspired by document [48], and a simple bilinear scoring function is applied to the module. See in particular equation (7).

Wherein W ^(b) Refers to the weight matrix of the contrast identification module, and sigma (-) represents the sigmoid function.

(4) Loss function

We apply the standard Binary Cross Entropy (BCE) penalty, which has been applied in the comparative self-supervised learning task, unlike the BCE applied to class balancing as set forth in the text, to effectively cope with the problem of class imbalance in the food detection task, we perform a balanced sampling between the positive and negative example pairs, and therefore we use a common BCE. We example pairs for each batch with a Total batch M

And (5) executing the operation, specifically see formula (8).

Here, CLM (·) represents a comparative learning model.

Third, sample risk assessment

After completion of the GNN-based contrast learning, the present embodiment discriminates the sample s using a classifier _i With adjacent sample groups

Consistency between them. Ideally, the lower the risk of a sample, the closer the predicted score of its positive and negative instance pairs is to the median (0.5); the higher the risk, the closer the predicted score of its positive and negative instance pairs is to both 0 and 1. The present embodiment defines the risk value of a sample as the absolute value of the difference between the positive and negative example pairs. Taking into account neighboring groups of samples

And selecting incompleteness and contingency, and sampling the detection sample by using a multi-round sampling mode.

Specifically, sampling each sample in the detection data, adopting the sampling strategy introduced by (II) to sample the positive and negative example pairs, and sampling to obtain the example pair I _i Are fed into a comparative learning model, and their predicted scores p are calculated according to the formula (9) _i 。

Finally, the risk value f(s) of the sample _i ) The prediction score is calculated by the average value of multiple sampling of the absolute value of the prediction score between the positive and negative example pairs, and is particularly shown in the formula (10).

Where R is the number of sampling rounds and f (-) is the mapping function of the detection data risk values, which is the final objective function of the CSGNN framework.

(III) evaluation index

The CSGNN framework aims to detect unqualified samples and complete risk classification on qualified samples in the process of completing a task of evaluating food safety risk, and the following 5 evaluation indexes are selected according to the objective.

The detection capability of the model on qualified samples and unqualified samples is comprehensively considered by the area AUC under the ROC curve, and the comprehensive performance of the CSGNN framework in the data set with class imbalance can be reasonably evaluated. Recall (Recall), also known as Recall, reflects the probability of an unqualified sample being misdetected as a qualified sample and is used by the present embodiment to evaluate the model's ability to Recall unqualified samples. Precision (Precision), also called Precision, reflects the probability of actually failing samples among all the detected failing samples, and this embodiment will measure the recognition ability of the model for failing samples and passing samples by Precision and passing sample-Precision (Precision of qualified samples), respectively. The false positive rate (FAR) reflects the probability that a qualified sample is detected as an unqualified sample, and is used for measuring the risk early warning capability of the model on the qualified sample.

The meaning of the four basic indicators TP, FP, FN and TN in the confusion matrix is shown in Table 5. Unqualified samples in the dataset are labeled as 1 label and qualified samples are labeled as 0 label. The evaluation indexes AUC, accuracy, qualified sample-accuracy, unqualified sample-recall rate and FAR are calculated in formulas (11) to (15).

TABLE 5 meanings of the base indicators in the confusion matrix

(IV) experiments and analyses

This example was fully tested and detailed experimental comparisons and analyses are given in this section.

(1) Baseline model

3 supervised models and 2 unsupervised models were selected as baseline, respectively. The supervised baseline model is NNLM, CNN and GCN, and the unsupervised baseline model is LOF and GAN.

①NNLM

NNLM is a classic shallow neural network model in the field of natural language processing, a word vector is introduced into the model for the first time, and the limitation of an N-gram model in relation modeling between words is successfully broken through. NNLMs are able to learn well the complex relationships between words, have played a role in the detection task, and therefore, are considered the first baseline model herein. We set the number of hidden layer neurons to be 16, the learning rate to be 0.00001, the batch size to be 16, and the epoch to be 30.

②CNN

CNN excels in capturing local feature information of data, is widely used in the fields of image recognition, voice recognition, and the like, and recently has been studied to apply it to biometric recognition and food safety tasks and to obtain significant effects. Using CNN as the second baseline model herein, two different convolution kernels were set, 4 each, with an activation function of ReLU, optimizer Adam, learning rate of 0.001, batch size of 32, and epoch of 20.

③GCN

The GCN can simultaneously mine attribute information and structural information in a topological graph for end-to-end learning, and is a currently mainstream GNN model. The GCN has stronger feature extraction capability than the CNN, solves the problem that the CNN cannot keep translation invariance in non-Euclidean data, can effectively mine complex associated information in the data, and has promising potential in classification tasks. Therefore, using GCN as the third baseline model herein to explore the effect of GNN algorithm in food detection task, two convolutional layers were set, activation function was ReLU, optimizer was Adam, learning rate was 0.01, and epoch was 200.

④LOF

LOF is a density-based unsupervised anomaly detection algorithm that determines whether a data point is anomalous by comparing the local neighborhood densities of each data point and its neighbors ^[52] . Inspired by anomaly detection in the attribute network, a similar principle exists between the anomaly detection and a food detection task, and the anomaly detection and the food detection task are all the anomaly data with mining characteristic information different from most data. Therefore, we used LOF as the fourth baseline model herein to explore the effectiveness of this type of anomaly detection algorithm in food detection tasks.

⑤GAN

The GAN is composed of a generator (G) and a discriminator (D), is a generating model based on an unsupervised learning mode, and is widely applied to the aspects of image generation, style migration and the like. Also, good performance was demonstrated in the detection and classification tasks, and therefore GAN was used as the fifth baseline model herein. Adam was chosen as the optimizer for G and D, ReLU as the activation function, learning rate 0.0001, batch size 32, epoch 500.

(2) Parameter setting

During the structured representation phase of the data, the network is completed in the same way as the GCN model in the baseline model

The structure of (3). In the comparative example pair sampling phase, the example pair I _i Adjacent sample group in (1)

Is fixedly set to 5 for less than adjacent sample groups

Fixed size nodes, will employ reusable nodesThe dots are arranged in such a way that they reach a set size. In the GCN-based comparative learning stage, the number of the module layers is set to be 1, and the embedding dimension is fixedly set to be 6. This example selects Adam as the optimizer, with a learning rate set to 0.006, a batch size of 450, and an epoch of 1000. We set the number of sampling rounds R to 256 and run the framework 10 times and calculate the mean to evaluate the overall CSGNN performance in the food testing task in order to avoid accidental results.

(3) Analysis of results

Advantages of the CSGNN framework over the baseline model, and these advantages will exhibit performance in practical application scenarios:

comparative experiments of 5 baseline models and the CSGNN model were performed on sterilized dairy assay data, and as shown in table 6, the CSGNN model performed better than all baseline models overall. Specifically, we have the following findings:

table 6 all models were initialized randomly 10 times and averaged (X in Input represents the Input data information, Y represents the label to which the data corresponds and the best performance value for each evaluation index in the supervised model is shown in bold, and the performance value for each evaluation index in the unsupervised model is shown in bold and underlined.)

1. The AUC values for GCN, LOF and CSGNN in all models were above 0.91. The AUC value of GCN is shown as the highest value 0.9988 for all models, indicating that the GNN algorithm shows very stable performance in the food testing task. For AUC values in the unsupervised model, 0.9150 for the LOF model performed best, probably because the objective of the anomaly detection task was to find a small number of anomalies in the majority of the data, which was targeted at the category imbalance problem. Second only to LOF model 0.001, CSGNN 0.9140 suggests that the GCN-based contrast learning module in the CSGNN framework can exhibit more stable performance when dealing with class imbalance data, which is critical in practical application scenarios.

2. The accuracy values for all models were above 0.95. However, the accuracy values of GCN and CSGNN are the highest values in the supervised and unsupervised models, respectively. For the supervised model, the accuracy value 0.9979 of the GCN was 0.0146 higher than that of the suboptimal model CNN; for the unsupervised model, the accuracy value 0.9829 for CSGNN was 0.0042 higher than the sub-optimal model LOF. In addition, the qualified sample-precision values for both GCN and CSGNN reached 1.0000, which was 0.0096 higher than the suboptimal model NNLM in all models. The GNN algorithm can better identify qualified samples and unqualified samples by mining complex associated information in detection data, and has promising potential in food detection tasks.

3. The off-spec sample-recall values for both GCN and CSGNN were 1.0000, an improvement of over 13% over the suboptimal model LOF in all models. The unqualified sample-recall ratio reflects the capability of the model for searching the unqualified samples in the detection data, and is an important task for evaluating the food safety risk. For other baseline models, the reason they fail to successfully detect all rejected samples may be a bottleneck encountered in mining complex feature information between test data samples. The two GNN models can accurately detect all unqualified samples, so that the attribute information and the topological information of the detection data are successfully captured by the GNN algorithm in a food detection task, and the application potential is good. More further, the CSGNN framework provides a solution for GNN algorithms in unsupervised applications.

4. The goal of the food inspection task is to mine both unacceptable samples and potentially risky acceptable samples in the inspection data. In order to successfully find qualified samples with potential risks, the lower the FAR value in the food inspection task, the better, because the too low FAR value will not help us to carry out risk classification on the qualified samples. For FAR values of a supervised model, NNLMs are more suitable for detecting data risk division tasks, and GCN and CNN respectively show too low and high performance and are not beneficial to reasonable risk division. For the FAR values of the unsupervised model, LOF and GAN show too low and too high, respectively, to be suitable for risk classification. The CSGNN can well realize risk division of detection data, because the CSGNN framework effectively solves the problem that the GNN is too low in the index by setting a hyper-parameter, and the specific scheme is detailed in the following part.

How to realize risk early warning in the food safety detection application of the CSGNN frame is as follows:

specifically, the CSGNN framework is divided into three stages of preprocessing and composition, training learning and risk assessment, and the unitless processing of detection data and the structural representation are realized in the preprocessing and composition stages; in the training stage, the comparative learning model completes the training of the example pair in a self-supervision mode; and in the risk evaluation stage, the risk value of each sample is obtained, and the two-classification judgment and the risk grade classification of the detection data are completed according to the risk value.

The risk value for each sample is defined as the absolute value of the difference in the prediction scores between its positive and negative example pairs. In the experiment process, the lowest value of the risk value of the unqualified sample in the data set is marked as W, and the W reflects the boundary value of the sample risk value with high risk possibility in the data set to a certain extent, which is also the reason for setting the W as the threshold value in the process of performing two-classification judgment on the detection data between the qualified sample and the unqualified sample. Ideally, for the risk sample after preprocessing, the greater the risk, the more the prediction scores of the positive and negative example pairs are distributed to both 0 and 1. While for the safety sample, the predicted scores for both the positive and negative example pairs are close to the median (0.5). Accordingly, a risk value of U-0.5 may default to a more conservative boundary value that distinguishes between risk and safety samples. According to the risk value, a detection sample is divided into four risk levels, namely a safety level, a low risk level, a medium risk level and a high risk level, wherein the risk level division is based on the following steps:

safe: indicating a level of security, the sample is substantially free of risk. 0 < risk value < U.

Low risk: indicating a low risk rating, the sample is less likely to be at risk. And U is less than or equal to the qualified sample with the risk value of less than W.

Medium risk indicates a risk level, and the sample has a high probability of presenting a risk. W is less than or equal to the risk value of 1.

High risk indicates a High risk rating to which all failing samples belong.

In order to more intuitively observe the risk pre-warning effect of the CSGNN framework, a visual display is performed, as shown in fig. 7. And according to the risk value distribution of the sample, the sample is divided into a safe sample, a low risk sample, a medium risk sample and a high risk sample in sequence.

The feasible explanation of the risk classification of the CSGNN framework in the food safety detection is as follows:

it can be seen from fig. 7 that samples of the same risk level are present in clusters in the graph, with sample separation distances being extremely small. This is because the difference of the risk values of the same risk level sample is small, and the difference of the risk values of different risk level samples is obvious, and there is no sample that does not conform to the rule, which indicates that the rule of dividing the samples with different risk levels is feasible. Specifically, for 2158 sterilized milk test data, 1734 of the 2117 qualified samples had risk values below 0.17, and therefore they were classified by the CSGNN framework into safe sample classes. The lowest risk value W of the 41 rejected samples was 0.674, all of which were classified into high risk sample grades. The risk values for 19 qualified samples were above 0.17 and below W, which were classified into low risk sample classes. The risk values for 364 qualifying samples were between W and 1, which were classified as being in the middle risk sample class. We observed the risk values for the three failed samples A, B and C that were closest to the W value, with the risk values for the A, B and C samples being 0.6744, 0.6782, and 0.6789, respectively.

The three samples were further examined for 6 specific values of evaluation index, wherein sample a failed due to the fat content of 3.57 and below the minimum value of the standard range of 3.7, sample B failed due to the solid content of nonfat milk of 4.69 and below the minimum value of the standard range of 8.5, and sample C failed due to the acidity content of 10.90 and outside the standard range of 11 to 16.

Furthermore, the CSGNNs show a large difference in the risk values of the safety and risk samples in the data set (there is a cliff-type gap between the risk values of the safety and other samples), and the boundary value for distinguishing the safety and risk samples is far lower than the default U value, even 0.17. The frame obviously highlights the information difference between the safety sample and the risk sample in the preprocessing process, the accidental judgment caused by local perception is smoothly avoided in the comparison and learning process of the multi-sampling example pair, the infinite approach of the prediction score of the safety sample corresponding example to the middle value (0.5) is realized in the process of calculating the risk value, and the prediction score of the risk sample corresponding example is distributed to the two sides of 0 and 1. In conclusion, the CSGNN framework finally has obvious hierarchical division on the risk values of the sterilized dairy product detection samples, and reasonable division on food safety risk levels is efficiently realized.

For unknown food detection data, after the CSGNN framework carries out data preprocessing and optimal composition on the unknown food detection data, the data is fed into a feed-forward neural network to obtain a risk value of each detection sample. In this process, the invention focuses on the lowest risk value W of the unqualified sample and the more obvious boundary value U between the risk sample and the safety sample (considering the quality of each detection data is uneven, if no obvious U value is observed, we will default to 0.5). The framework realizes risk division of different food detection data according to W and U values obtained from different data according to the principle of how the CSGNN framework realizes the presentation of the risk early warning part in the food safety detection application.

(4) Framework application and optimization

In consideration of the rigor of government work, the method can also introduce a panel of experts in food quality supervision departments to participate in example analysis and application, observe the early warning result of risks generated by the framework and perform manual intervention when necessary. Therefore, the CSGNN framework improves the working efficiency of the expert group, and the rationality and stability of the risk assessment result are ensured by the expert group. Meanwhile, in the process, the improvement suggestions of the expert group to the framework work are collected, so that the robustness and the generalization capability of the CSGNN framework are further improved, and the food quality safety supervision work is promoted.

In the embodiment, it is proposed to apply the GNN-based contrast-based self-supervised learning to early warning and control of food safety risk, which is the first attempt of GNN-based self-supervised learning in food safety warning analysis. The invention innovatively provides an end-to-end food safety risk assessment framework CSGNN. The method is composed of four parts, namely data preprocessing, comparative example sampling, GCN-based comparative learning and sample risk assessment. The CSGNN framework is applied to a large amount of sterilized milk detection data in Guizhou province of China, and experimental results show that the CSGNN framework successfully excavates attribute information and structure information among different indexes of food detection data, obtains risk values corresponding to all examples in a contrast type self-supervision mode, and achieves risk assessment. Sufficient experiments show that the framework of the invention can detect all unqualified samples, has better stability and lower false detection rate in the practical application of data class unbalance, and has satisfactory effect on AUC value and unqualified sample-recall ratio. The invention provides a new idea for the food safety risk assessment method, and the complexity and the time cost in the work are greatly reduced by the self-supervision learning mode. Similarly, the food safety supervision department can make more efficient decisions according to the detection result of the CSGNN and an expert group.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention are within the scope of the technical solution of the present invention.

Claims

1. A food safety risk early warning method comprises the following steps:

(2) sampling the sample data to obtain an example pair;

2. The warning method according to claim 1,

in the step (2), the detection data are respectively normalized by the following formulas according to the classification of the forward index, the reverse index and the oscillatory index;

wherein the content of the first and second substances,

3. the warning method according to claim 1,

the sampling process in the step (2) sequentially comprises the following steps: determining pairs of sampled samples, sampled neighboring sample groups, hidden sampled samples and synthesized instances.

4. The warning method according to claim 1,

in the step (3), a classifier is adopted to identify the consistency of the example pairs, and the more the prediction scores of the positive and negative example pairs are closer to the middle value, the lower the risk is; conversely, the higher the risk.

5. The warning method according to claim 4,

in the step (3), after comparative learning is performed, the prediction score is calculated by adopting the following formula:

are adjacent sample sets;

6. The warning method according to claim 1,

in the step (3), the comparison learning model comprises a GCN module, a dimension reduction module and an embedding comparison identification module.

7. The warning method according to claim 6,

the dimensionality reduction module adopts a formula as follows:

wherein the content of the first and second substances,

The amount of sample in (1).

8. The warning method according to claim 7,

the embedding comparison and identification module adopts a formula as follows:

9. The warning method according to claim 1,

in the step (3), a binary cross entropy loss function is also adopted in the comparison learning model and is used for solving the problem of category imbalance in the food detection task.