CN117765391A

CN117765391A - Method, system, equipment and medium for predicting total ginsenoside content based on machine learning

Info

Publication number: CN117765391A
Application number: CN202311759588.4A
Authority: CN
Inventors: 张巍; 白雪媛; 赵大庆
Original assignee: Changchun University of Chinese Medicine
Current assignee: Changchun University of Chinese Medicine
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-03-26

Abstract

The invention discloses a method, a system, equipment and a medium for predicting total ginsenoside content based on machine learning. The method for predicting the total ginsenoside content comprises the following steps: receiving a hyperspectral image of a ginseng sample to be detected; cutting out hyperspectral images, forming a main body file by embedding the cut-out images, carrying out reflectivity correction on hyperspectral data of the main body file, extracting a main body file region of interest (ROI) on the hyperspectral data after reflectivity correction based on an R value, and generating a plurality of main body file ROIs; performing principal component analysis and operation processing on hyperspectral data of the ROI of the main body file, selecting the first 10 main band intervals according to single-band imaging saturation, selecting modeling bands, and selecting R values of two bands with highest correlation from the selected modeling bands to make a ratio; inputting the calculated R value ratio into a ginseng total saponin content prediction model to predict the ginseng total saponin content of the ginseng sample to be detected.

Description

Method, system, equipment and medium for predicting total ginsenoside content based on machine learning

Technical Field

The invention belongs to the technical field of image processing based on machine learning, and particularly relates to a method, a system, equipment and a medium for predicting total ginsenoside content based on machine learning.

Background

The current prediction method of the total ginsenoside content mainly uses near infrared diffuse reflection and near infrared transmission spectrometry, and utilizes chemometry for modeling prediction. Collecting near infrared diffuse reflection spectrum of a solid sample of a raw ginseng medicinal material, collecting near infrared transmission spectrum of a liquid sample of ginseng extract, performing multi-element scattering correction and smoothing treatment on the obtained near infrared spectrum data, simultaneously measuring the real content of total saponins in the ginseng sample by a high performance liquid chromatography or a colorimetric method, and finally establishing a correction model and a prediction model of the near infrared spectrum by a partial least squares method.

Although the near infrared prediction methods have the characteristics of in-situ and rapid detection, the acquired spectrum information is only spectrum data (point acquisition) of a single sampling point no matter near infrared diffuse reflection or near infrared transmission sampling, even if the spectrum is repeatedly acquired for a plurality of times, the spectrum information overall characteristics (small information quantity) of the whole ginseng can not be effectively and comprehensively reflected, and meanwhile, the artificial control factors are large, time and labor are consumed during repeated sampling, so that the authenticity of a prediction model established by using a small amount of spectrum information is still to be agreed.

Therefore, it is necessary to establish a prediction model of total ginsenoside content based on the overall characteristics of spectral information of the whole ginseng, and to more accurately predict the total ginsenoside content based on the prediction model.

Disclosure of Invention

The invention provides a method, a system, equipment and a medium for predicting total ginsenoside content based on machine learning, which are used for solving the problem that the existing prediction model in the prior art cannot reflect the overall spectral characteristics of the whole ginseng, so that the real total saponin content cannot be accurately reflected.

In a first aspect, the present invention provides a method for constructing a model for predicting total ginsenoside content of ginseng based on machine learning (hereinafter referred to as "the method for constructing the present invention"), comprising the steps of:

1) Obtaining hyperspectral images of a plurality of personal ginseng samples in a 900-1700 nm wave band, and real data of total ginsenoside content of the ginseng samples;

2) Cutting the hyperspectral images of the personal reference samples, forming a main body file by mosaic processing of the cut images, carrying out reflectivity correction on hyperspectral data of the main body file, and converting a pixel brightness DN value (hereinafter referred to as DN value) of a remote sensing image of an original image into a relative reflectivity R value (hereinafter referred to as R value) of the image;

3) Extracting a region of interest (hereinafter referred to as ROI) from the hyperspectral data of the subject file subjected to reflectivity correction based on the R value, and generating a plurality of subject file regions of interest;

4) Taking the interested areas of the plurality of main files as sample data, randomly dividing the sample data into two parts according to the proportion of 3:1, and respectively marking the two parts as a modeling data set and a verification data set;

5) Performing Principal Component Analysis (PCA) operation processing on the hyperspectral data of the region of interest of the main body file, using covariance matrix calculation, extracting spectral information of a main characteristic wave band reflecting the main body spectral information of the ginseng, and eliminating useless spectral information;

6) Selecting the first 10 main wave band intervals obtained through principal component analysis processing according to the single-wave band imaging saturation, selecting a modeling wave band, selecting the ratio of R values of two wave bands with highest correlation from the selected modeling wave bands as independent variables, taking the real data of the total ginsenoside content as dependent variables, and constructing a unitary linear regression model by utilizing a modeling data set obtained through random segmentation as the total ginsenoside content prediction model.

In the invention, the hyperspectral image of the 900-1700 nm wave band of the ginseng sample can be obtained by scanning the ginseng sample by a hyperspectral imager and collecting the data of the 900-1700 nm wave band.

In one embodiment of the construction method of the present invention, the generating of the body file in step 2) may include: and cutting out the hyperspectral image of each ginseng according to the shape edge of the ginseng, generating a mask file at the outer part of the cutting area, and performing mosaic merging processing based on pixels on a plurality of cut images to form a main body file.

In one embodiment, the reflectance correction may be performed by the following formula:

wherein R represents the relative reflectivity of the corrected image, I _R DN value representing original image, I _W DN value representing whiteboard image, I _B DN values representing dark reference images obtained by covering the lens with an opaque cover.

In one embodiment, in step 3), the method of region of interest (ROI) extraction may be: and drawing the region of interest within the range of the ginseng shape edge by using an arbitrary polygonal ROI extraction tool.

In another embodiment of the construction method of the present invention, the argument may be selected by: the R value of the selected modeling wave band is subjected to ratio, and the absolute value of the correlation coefficient between the total ginsenoside content of the ginseng and the ratio of each wave band is calculated; and selecting the ratio of R values of two wave bands with the highest values of the absolute values of the correlation coefficients as independent variables.

In another embodiment, the selecting of the modeling band in step 6) may include the steps of: selecting the first 10 main wave band intervals extracted by PCA according to single wave band imaging saturation, performing supervised classification operation on the interested region of the main body file by using an SVM algorithm to obtain classification results of different characteristics of ginseng, performing statistical analysis and confusion matrix precision evaluation on the SVM characteristic classification results, selecting modeling wave bands by taking standard deviation of characteristic wave bands in the statistical analysis results as indexes, selecting the first 8 alternative wave bands as main modeling wave bands by comparing standard deviation values of the characteristic wave bands and taking the standard deviation values of the characteristic wave bands as selection basis, and then selecting the ratio of R values of two wave bands with highest relativity from the main modeling wave bands as independent variables.

In the construction method of the invention, the ratio of R values refers to the ratio of R values corresponding to 8 images in different wave bands on any pixel after PCA processing. The basis for the determination of "highest correlation" is that the correlation coefficient has the highest value, such as the b2/b1 band ratio in fig. 6. The method for calculating the correlation coefficient can be calculated by adopting a CORREL function in Excel, and the absolute value of the correlation coefficient can be calculated by adopting an ABS function in Excel.

In one embodiment, performing a supervised classification operation on the region of interest of the subject document using an SVM algorithm may include the steps of:

s100, extracting an average value of R values of the ginseng samples corresponding to the main characteristic wave bands from each main file region-of-interest sample in the modeling data set to be used as classification information of the main file region-of-interest sample;

s110, classifying all the main body file interesting region samples in the modeling data set into different categories, wherein the main body file interesting region samples from the same ginseng sample are classified into the same category;

s120, constructing a support vector machine for classifying all samples according to the category of the sample of the region of interest of each main body file in the modeling data set and the classification information of the sample of the region of interest of each main body file;

s130, determining a prediction type of a sample of an interest area of each main body file on a verification data set, determining a classification confusion matrix according to an actual type and a prediction type of the sample of the interest area of each main body file in the verification data set, and determining classification precision of a current support vector machine in different types according to the classification confusion matrix;

s140, calculating the total accuracy of the current support vector machine, wherein the total accuracy is defined as the ratio obtained by dividing the number of samples of the region of interest of the main body file, which is predicted to be correct by the support vector machine in the verification data set, by the number of samples of the region of interest of all main body files in the verification data set;

S150, judging whether the total accuracy is larger than a preset accuracy threshold, if so, proceeding to S170, and if not, executing S160;

s160, updating the category of the region-of-interest sample of the main body file in the modeling data set, returning to S110, constructing a new support vector machine according to the category of the region-of-interest sample of each main body file in the updated modeling data set, continuously checking the total accuracy of the new support vector machine, repeating the process until a support vector machine with the total accuracy greater than a preset accuracy threshold is obtained, and then executing S170;

s170, carrying out statistical analysis on classification results of different characteristics of the ginseng obtained by the support vector machine determined in S150 or S160, and selecting modeling wave bands by taking standard deviations of characteristic wave bands in the statistical analysis results as indexes.

In a second aspect, the present invention provides a method for predicting total ginsenoside content, comprising the steps of: receiving a hyperspectral image of a 900-1700 nm wave band of a ginseng sample to be detected; cutting the hyperspectral image of the ginseng sample to be detected, forming a main body file by embedding the cut image, carrying out reflectivity correction on hyperspectral data of the main body file, converting DN values of original images into R values of relative reflectivity of the images, extracting a main body file region of interest from the hyperspectral data of the main body file subjected to reflectivity correction based on the R values, and generating a plurality of main body file regions of interest; performing principal component analysis operation processing on the hyperspectral data of the region of interest of the main body file, using covariance matrix calculation to extract spectral information of main characteristic wave bands reflecting the spectral information of the ginseng main body, selecting the first 10 main wave band intervals obtained through principal component analysis according to single-band imaging saturation, selecting modeling wave bands, and selecting R values of two wave bands with highest correlation from the selected modeling wave bands to be used as a ratio; inputting the calculated R value ratio into a ginseng total saponin content prediction model constructed by the construction method to predict the ginseng total saponin content of the ginseng sample to be detected.

In a third aspect, the present invention provides a system for predicting total ginsenoside content, comprising: the data receiving module is configured to receive hyperspectral images of 900-1700 nm wave bands of the ginseng sample to be detected; the characteristic extraction module is configured to cut the hyperspectral image of the ginseng sample to be detected, form a main body file through mosaic processing of the cut image, conduct reflectivity correction on hyperspectral data of the main body file, convert DN values of original images into relative reflectivity R values of the images, extract a subject file interested region on the hyperspectral data of the main body file after reflectivity correction based on the R values, and generate a plurality of main body file interested regions; the calculation module is configured to perform principal component analysis operation processing on the hyperspectral data of the region of interest of the main body file, use covariance matrix calculation to extract spectrum information of main characteristic wave bands reflecting the spectrum information of the ginseng main body, then select the first 10 main wave band intervals obtained through principal component analysis according to single-band imaging saturation, select modeling wave bands, and select R values of two wave bands with highest correlation from the selected modeling wave bands to make a ratio; the content prediction module: the content prediction module is configured to input the R value ratio calculated by the calculation module into a total ginsenoside content prediction model constructed by the construction method of the invention to predict the total ginsenoside content of the ginseng sample to be detected.

In a fourth aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for predicting total ginsenoside content of the present invention when executing the program.

In a fifth aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processor implements the ginseng total saponin content prediction method of the present invention.

In the construction method of the present invention, the ginseng samples used for constructing the ginseng total saponin content prediction model may be ginseng samples from different producing areas, different varieties, and different ages. The producing area may include domestic common producing areas, such as ginseng, american ginseng, korean ginseng planting areas, etc. in Jilin province, liaoning province, heilongjiang province, shandong province, etc. Varieties may include, but are not limited to, garden ginseng, american ginseng, korean ginseng, mountain ginseng under forest, wild ginseng, and processed products thereof such as red ginseng, black ginseng, etc. The age of the ginseng sample may be 3 years to 80 years. The number of ginseng samples is generally not less than 56.

In one embodiment, the features of each ginseng sample are highlighted when the ginseng samples are placed during the spectrum scanning, and the samples are not closely placed in an overlapping manner. An example of a spectral scanning condition is as follows: the distance between the lens of the hyperspectral imager and the ginseng is 25-35 cm; the moving speed of the platform is 3mm/s; the integration time is 3ms and the frame rate is 20 frames/second. The number of spectral scans per ginseng sample may be 3.

In the construction method of the invention, in the selection of modeling wave bands, the basis of selecting the first 10 wave band intervals is as follows: noise can be identified from the single-band image processed by PCA, the first 10 band intervals can effectively eliminate noise interference, and the main spectrum information of the sample characteristics is covered.

Compared with the prior art, the invention has the following beneficial technical effects:

the method for predicting the total saponins of the ginseng by adopting the prediction model of the total saponins of the ginseng constructed by the construction method disclosed by the invention not only has the characteristics of in-situ, rapidness, accuracy and no damage, but also has the advantage of 'surface acquisition', can effectively make up the defect of 'point acquisition', namely, all spectral information and image information of the surface of the raw medicinal material of the ginseng can be obtained by single scanning acquisition, and the prediction result information is more comprehensive, specific and vivid.

Drawings

FIG. 1 shows the corrected reflectance spectrum of the ginseng (900-1700 nm).

Figure 2.56 hyperspectral images of a body file consisting of ginseng samples.

Fig. 3, main wave band synthesized pseudo color chart after main spectrum data PCA process of ginseng.

Fig. 4 is a classification chart of different characteristics of the surface of the ginseng SVM.

Fig. 5.Svm each class of characteristic band standard deviation.

Fig. 6. Correlation coefficient of the ratio of the main band R values of pca with the measured values.

FIG. 7 shows a total ginsenoside content correction model.

FIG. 8 shows the actual measurement and estimation of total saponins of Ginseng radix.

FIG. 9A model of total ginsenoside content correction (random partition model 1).

FIG. 10 shows the actual measurement and estimation of total saponins content of ginseng (random partition verification 1).

FIG. 11A model for correcting total ginsenoside content (random division model 2).

FIG. 12 shows the actual measurement and estimation of total saponins content of ginseng (random partitioning verification 2).

FIG. 13A model of total ginsenoside content correction (random partition model 3).

FIG. 14 shows the actual measurement and estimation of total saponins content of Panax ginseng (random partition verification 3).

FIG. 15 average reflectance spectra (400-1000 nm) of 168 ROI samples from ginseng.

Detailed Description

The invention will be further illustrated with reference to the following specific examples, but the invention is not limited to the following examples. The methods are conventional methods unless otherwise specified. Raw materials are available from published commercial sources unless otherwise specified.

The ginseng used in the following examples was produced from Bai Shanshi Jingyu county, jilin province of China, and was planted in 5-year old farmland.

The hyperspectral imager used in the examples described below is an IMPERX series hyperspectral imaging camera.

Example 1 Process for constructing machine-learning-based prediction model of Total saponins of Ginseng radix of the present invention

The following describes a detailed experimental process and data results of the construction process of the machine learning-based ginseng total saponin content prediction model according to the present invention by means of specific examples as follows:

1. and carrying out hyperspectral scanning on all 56 personal reference samples, and carrying out reflectivity correction on the obtained hyperspectral original data so as to eliminate noise interference of environments, instruments and the like, and converting DN values into uniform relative reflectivity values. The method comprises the following steps:

(1) Carrying out spectrum scanning on all 56 samples of 5-year-old farmland cultivated ginseng purchased from Bai Shanshi Jingyu county of Jilin province, and collecting 900-1700nm hyperspectral data; cutting the obtained hyperspectral image of the ginseng sample, namely cutting out the hyperspectral image of each ginseng according to the shape edge of the ginseng (generating a mask file outside a cutting area), and performing pel-based mosaic merging processing on the cut image to generate a hyperspectral image of a main file consisting of 56 personal ginseng samples, wherein fig. 2 is a part of the hyperspectral image.

(2) The distance between the lens of the hyperspectral imager and the ginseng sample can be 25-35cm, the distance is determined by the focal length of the lens, the lens is not required to exceed the range of the lens as much as possible during each scanning, the characteristics of each ginseng sample are highlighted when the ginseng sample is placed, the ginseng sample is not required to be overlapped and placed tightly, the platform moving speed can be 3mm/s, the integration time can be 3ms (the integral time is the photon number entering the lens in unit time, the longer the integral time is, the higher the image quality is), the frame frequency can be 20 frames/second (the frame frequency refers to the number of images displayed per second), the spectrum scanning times of each ginseng sample are 3 times, the black and white plate scanning is required to be performed before the sample scanning for the reflectivity correction, and the reflectivity correction formula is as follows:

Wherein R represents the relative reflectivity of the corrected image, I _R DN value representing original image (DN value is brightness value of remote sensing image pixel), I _W DN value representing whiteboard image, I _B DN values representing dark reference images obtained by covering the lens with an opaque cover.

Introducing collected hyperspectral data by adopting ENVI software, and processing the original image data into the relative reflectivity data of the image by utilizing a reflectivity correction formula; wherein a part of corrected reflectance spectrum of ginseng is shown in FIG. 1;

(3) Extracting a region of interest (ROI) from the hyperspectral data of the body file subjected to reflectivity correction;

through the step (3), 168 ROI samples can be obtained from the hyperspectral data of the main document, and each ROI sample includes hyperspectral image data obtained by performing one-time spectrum scanning on one ginseng sample, and since the embodiment performs 3-time spectrum scanning on 56 ginseng samples, 168 ROI samples are obtained.

One ROI sample may specifically include the relative reflectivity of different positions of the scanned surface of the ginseng sample to light rays of different wave bands under one spectral scan. For example, the ROI sample obtained by performing the first spectral scan on the ginseng sample No. 2 in the 56 personal ginseng samples includes the relative reflectivity of the light rays of different wave bands corresponding to each position of the ginseng sample No. 2 in the scanning process.

(4) 168 ROI samples selected from the hyperspectral image of the main body file are divided into two parts according to the proportion of 3:1, and the two parts are respectively recorded as a modeling data set and a verification data set.

It should be noted that the segmented verification data set may satisfy the following conditions:

for each ROI sample in the validation dataset, there is at least one ROI sample in the modeling dataset from the same ginseng sample as the ROI sample.

Illustratively, the validation data set includes ROI samples obtained from a second spectral scan of the ginseng sample number 41, and the modeling data set may include two ROI samples obtained from first and third spectral scans of the ginseng sample number 41.

2. Performing PCA (principal component analysis) operation processing on the main body file hyperspectral image data of the ginseng (namely, the main body file hyperspectral image generated in the step (1)) and extracting main characteristic wave bands of main body spectrum information of the ginseng by using covariance matrix calculation.

For each ginseng sample, the relative reflectivity of the ginseng sample to the light of the main characteristic wave band can be obtained from the ROI sample corresponding to the ginseng sample, and a pseudo-color image can be generated based on the obtained relative reflectivity, for example, fig. 3 is a pseudo-color image of a part of the ginseng sample.

The extraction of the main characteristic wave band has the advantage of eliminating a large amount of useless spectrum information, thereby improving the processing efficiency of the subsequent classification process. The spectral information is understood to mean the relative reflectivity of the reference sample to light of different wavelength bands.

The manner of determining the dominant characteristic bands by principal component analysis may include:

first, for the ith ROI sample (i has a value ranging from 1 to 168), the first eigenvector Pi corresponding to the ROI sample is determined to be (x 1i, x2i, x3i … … xki), k represents the kth band used in the spectrum scanning, k has a value ranging from 1 to m, m is the total number of bands used in the spectrum scanning, xki represents the average value of the relative reflectivity of the ith ROI sample in the kth band, and the calculation method may be that the relative reflectivity of each position of the ginseng sample in the ith ROI sample to the light of the kth band is averaged.

Then, all the first eigenvectors Pi are subjected to a decentering process according to the formula (1) to obtain a decentered eigenvector Xi.

Next, the covariance matrix of all the decentered feature vectors is calculated according to equation (2).

Where X represents a matrix of all the decentered feature vectors Xi, each of which serves as the ith row of the matrix.

After obtaining the covariance matrix, the eigenvalue of the covariance matrix can be solved by utilizing the algorithm of the eigenvalue of the matrix to obtain a plurality of eigenvalues of the covariance matrix, and among the eigenvalues, S eigenvalues are sequentially selected according to the sequence from large to small, eigenvectors corresponding to the S eigenvalues are calculated, and the S eigenvectors form an eigenvector matrix W.

Finally, performing dimension reduction processing on each first feature vector Pi according to a formula (3) to obtain a dimension reduction feature vector Zi.

Zi＝W ^T Pi (3)

After obtaining the dimension-reducing feature vectors of all the ROI samples, selecting any S wave bands from m wave bands used in the optical scanning, forming a second feature vector (x 1i, x2i, … … xsi) by using elements corresponding to the S wave bands in the first feature vector, wherein the value range of S is 1 to S, calculating the cosine similarity of the second feature vector and the dimension-reducing feature vector corresponding to the same ROI sample, averaging the cosine similarity of all the ROI samples to obtain reference values corresponding to the selected any S wave bands, repeating the process, repeatedly determining the combination of different S wave bands to obtain the reference value corresponding to each S wave band, and determining the S wave bands corresponding to the largest reference value as the main feature wave bands.

For example, assuming that S is 3, in the above procedure, the reference value of the combination of the first 3 bands among the m bands may be calculated as a combination, then the reference value of the combination may be calculated with the 4 th, 5 th and 6 th bands as a combination, the reference value of the combination may be calculated with the 7 th, 8 th and 9 th bands as a combination, and so on, and finally the reference value of the combination of the first 3 bands among the plurality of reference values is found to be the largest, and then the 1 st, 2 nd and 3 rd bands among the m bands used for the spectral scanning are determined as the main feature bands.

The number of principal characteristic bands determined by principal component analysis may be 3 or other numbers, and is not limited.

In the principal component analysis method described above, the number of selected feature values may be identical to the number of principal feature bands to be finally determined, that is, if 3 principal feature bands are to be determined, 3 feature values may be selected from large to small when the feature values are selected.

Fig. 3 is generated from the first 3 bands after PCA processing, i.e., band 1-band 3 intervals.

3. Modeling work of a ginseng total saponin prediction model is carried out:

(1) Real data of total ginsenoside content of 56 ginseng samples are measured by a colorimetry method respectively, each ginseng sample is measured for 3 times in parallel, and 168 actually measured content values are obtained in total;

(2) According to the method of the step (one) and the step (two), the spectral information of the main characteristic wave band of the main spectral information of the ginseng extracted by the PCA is obtained. The first 10 main band intervals extracted by PCA are selected according to the single band imaging saturation, and the basis of selecting the 10 bands is as follows: noise can be identified from the single-band image processed by PCA, the first 10 band intervals can effectively eliminate noise interference, and the main spectrum information of the sample characteristics is covered. And (3) performing supervised classification operation on the ROI area by using an SVM (support vector machine) algorithm, wherein the kernel function type in the SVM is RBF (radial basis function), the kernel function parameter L is set to 0.1, the penalty coefficient is set to 100, classification results of different characteristics of ginseng are obtained (figure 4), statistical analysis and confusion matrix precision evaluation are performed on the classification results of the SVM characteristics (table 1 and table 2), the standard deviation of characteristic wave bands in the statistical analysis results are used as indexes to perform optimization of modeling wave bands (figure 5), the first 8 wave bands are finally selected as main modeling wave bands by comparing the standard deviation values of the wave bands in figure 5, the R values of the 8 wave bands are used as ratios, and the absolute value of the correlation coefficient between the total ginsenoside content of the ginseng and the ratio of each wave band is calculated (figure 6).

The manner in which the dominant modeled bands are determined in step (2) is as follows.

a1, selecting 10 wave bands from high to low according to imaging saturation of each wave band as alternative wave bands in all wave bands used for spectrum scanning.

a2, extracting the average value of the relative reflectivity of the ginseng sample corresponding to the main characteristic wave band in each ROI sample in the modeling data set as the classification information of the ROI sample.

a3, dividing all the ROI samples in the modeling data set into 4 different categories, and sequentially representing the categories 1 to 4. In this step, all ROI samples in the modeling data set may be equally divided into 4 different categories, or may be randomly allocated according to a certain probability, without limitation.

Wherein the ROI samples derived from the same ginseng sample should be classified into the same category.

a4, constructing a support vector machine for classifying the ROI samples according to the category to which each ROI sample belongs in the modeling data set and the classification information of each ROI sample.

The process of constructing the support vector machine is as follows:

b1, determining classification samples (xi, yi, zi) corresponding to the ith ROI sample, wherein xi is a vector, each element of the vector corresponds to each main characteristic wave band, the value of the element is the average value of the relative reflectivity of the ith ROI sample in the ginseng sample of each main characteristic wave band, yi and zi represent the category of the ROI sample, and 126 (three fourths of 168) classification samples are obtained.

Where yi may be-1 or 1, zi may be-1 or 1, there may be four different combinations of yi and zi, corresponding to four categories, respectively, for example, yi= -1, zi= -1 represents a first category, yi= -1, zi=1 represents a second category, yi=1, zi= -1 represents a third category, yi=1, zi=1 represents a fourth category.

b2, a penalty factor C greater than 0 is set, e.g. C is set equal to 100.

b3, determining an objective function shown in a formula (4) and a constraint condition shown in a formula (5) according to each classified sample and the penalty coefficient.

The number of the elements of the vector Alpha is consistent with the number of the classified samples, and the number of the elements of the vector Alpha is 126.

Each element of vector Alpha is greater than or equal to 0 and less than or equal to penalty coefficient C.

And b4, calculating a vector Alpha for enabling the function value of the objective function to meet the minimum condition on the premise of meeting the constraint condition by utilizing any algorithm for solving the linear programming problem in the related technology.

b5, determining a first decision coefficient b1 and a first decision vector w1 according to formulas (6) and (7).

Where yk and xk represent vectors and categories of arbitrarily selected kth classification samples of the plurality of classification samples.

b6, determining a first classification decision function for determining the y value of any classification sample according to the first decision vector and the first decision coefficient and formulas (8) and (9).

Wherein sign represents a sign function, the value of the self-variable in the brackets of the function is positive, the value of the output function is 1, and when the value of the self-variable is negative, the value of the output function is-1. x represents a vector formed by the average value of the relative reflectivities of the main characteristic wave bands in any ROI sample to be classified. I x-xi i represents the norm of the vector xi-x, w1i represents the i-th component of the first decision vector w1, and L is a preset kernel function parameter, which may be set to 0.1, for example.

b7, substituting yi in the process of b3 to b6 with zi, substituting yk with zk, and obtaining a second classification decision function for determining the z value of any classification sample according to the same process, as shown in formulas (10) and (11).

Wherein Beta represents a second support vector determined by the method of determining the first support vector described above, w2 represents a second decision vector determined by the method of determining the first decision vector, and b2 represents a second decision coefficient determined by the method of determining the first decision coefficient.

The first classification decision function and the second classification decision function form a support vector machine to be constructed in a 4.

and a5, for each ROI sample in the verification data set, forming a vector x of the ROI sample by the average value of the relative reflectivity of each main characteristic wave band of the ROI sample, inputting the vector x into a support vector machine constructed in a4, obtaining a y value and a z value of the ROI sample, and further determining the prediction category of the ROI sample.

a6, determining the actual category of each ROI sample in the verification data set.

For each ROI sample of the verification dataset, the actual class may be determined as follows.

Finding and verifying that this ROI sample of the dataset originates from the same ginseng sample and that another ROI sample belonging to the modeling dataset determines the category assigned in step a4 for the other ROI sample originating from the same ginseng sample as the actual category of the ROI sample in the verification dataset.

a7, determining a classification confusion matrix according to the actual category and the predicted category of each ROI sample in the verification data set (table 1), and determining the classification precision of the current support vector machine in different categories according to the classification confusion matrix (table 2).

TABLE 1

Category(s)	Prediction category 1	Predictive category 2	Prediction category 3	Prediction category 4	Totals to
						Actual class 1	12	0	0	0	12
Actual class 2	0	8	3	1	12
						Actual class 3	0	2	6	0	8
Actual category4	0	0	1	9	10
						Totals to	12	10	10	10	42

TABLE 2

In table 1, each cell represents the number of ROI samples in the verification data set having a prediction category corresponding to the column and having an actual category corresponding to the row, for example, the value of the cell at the intersection of the prediction category 2 and the actual category 3 in table 1 is 2, which indicates that in the verification data set, the prediction category having 2 ROI samples is category 2, but the actual category is category 3.

In table 1, only ROI samples belonging to cells on the diagonal, that is, cells whose corresponding prediction category and actual category agree with each other, are ROI samples that are correctly predicted by the support vector machine, and ROI samples belonging to other cells are samples that are incorrectly predicted.

In table 2, the accuracy of each category is equal to the number of correctly predicted ROI samples under that category in the verification data set divided by the total number of ROI samples for that category for the prediction category, i.e., the number of correctly predicted cells divided by the total value of the column in which it is located, and the sensitivity of each category is equal to the number of correctly predicted ROI samples under that category in the verification data set divided by the total number of ROI samples for that category for the actual category, i.e., the number of correctly predicted cells divided by the total value of the row in which it is located.

and a8, calculating the total accuracy of the support vector machine.

The total accuracy may be defined as the ratio of the number of ROI samples in the verification data set that the support vector machine predicts correctly divided by the number of all ROI samples in the verification data set.

a9, judging whether the total accuracy is larger than a preset accuracy threshold.

If the accuracy threshold is greater, step a11 is performed, and if the accuracy threshold is not greater, step a10 is performed.

a10, updating the category to which the ROI sample in the modeling data set belongs.

For example, a portion of the ROI samples in the modeling dataset that were originally assigned to category 1 may be changed to category 2 and a portion of the ROI samples that were originally assigned to category 2 may be changed to category 3. The information of which categories of ROI samples are specifically reduced, which categories of ROI samples are added, specific numbers of reduction and addition, and the like can be randomly determined.

Alternatively, this step may be performed in connection with the accuracy of the different categories shown in table 2, in particular if the accuracy of the category is higher, e.g. the highest accuracy of all categories, the ROI samples contained in the category are changed as little as possible, if the accuracy of the category is lower, the ROI samples contained in the category are changed as much as possible, e.g. a plurality of ROI samples are newly allocated for the category, or a plurality of ROI samples are reduced from the category.

After step a10 is executed, returning to step a4, constructing a new support vector machine according to the category of each ROI sample in the updated modeling data set, continuously checking the total accuracy of the new support vector machine, and repeating the process until a support vector machine with the total accuracy being greater than an accuracy threshold is obtained.

a11, dividing 168 ROI samples into 4 categories by using a support vector machine, and calculating standard deviation of each category on 10 alternative wave bands selected in a 1.

After the support vector machine is obtained, for each of 168 ROI samples, the average value of the relative reflectivity of each main characteristic band of the ROI sample may be formed into a vector x of the ROI sample, and the vector x is input into the support vector machine constructed in a4 to obtain the y value and the z value of the ROI sample, thereby determining the category of the ROI sample.

For each category, the relative reflectivity average for each ROI sample for that category over each alternative band may be calculated.

For example, for ROI sample 1 belonging to category 1, the relative reflectance values of the alternative band 1 light at each position of the ginseng sample may be obtained from ROI sample 1, then the relative reflectance values of the alternative band 1 light at each position of the ginseng sample may be averaged to obtain the average value of the relative reflectance of ROI sample 1 belonging to category 1 over alternative band 1, and similarly, the average value of the relative reflectance of ROI sample 2 belonging to category 1 over alternative band 1 may be obtained.

Then, for each category, for each alternative band, calculating the standard deviation of the average value of the relative reflectivities of all the ROI samples belonging to the category on the alternative band, and obtaining the standard deviation of the category on the alternative band.

In combination with the above example, assuming that 50 ROI samples belong to class 1, the relative reflectivity average of the 50 ROI samples over alternative band 1 can be calculated in a11, and then the standard deviation of the relative reflectivity average over the 50 alternative band 1 is calculated, and the obtained result is the standard deviation of class 1 over alternative band 1.

a12, calculating the weighted average standard deviation of each alternative band.

For each alternative band, the average standard deviation of the alternative band is equal to the weighted average of the standard deviations of each class in the alternative band.

For example, for the alternative band 1, the standard deviation of the class 1 to the class 4 in the alternative band 1 is sequentially recorded as a standard deviation 11 to a standard deviation 14, and the first weighting coefficients corresponding to the class 1 to the class 4 are sequentially recorded as a first weighting coefficient 1 to a first weighting coefficient 4. The calculation process of the weighted average standard deviation of the candidate band 1 may be to sequentially calculate the product of the standard deviation 11 and the first weighting coefficient 1, the product of the standard deviation 12 and the first weighting coefficient 2, the product of the standard deviation 13 and the first weighting coefficient 3, and the product of the standard deviation 14 and the first weighting coefficient 4, then add the 4 products, divide the sum by 4, and obtain the result as the weighted average standard deviation of the candidate band 1.

Wherein the first weighting coefficient corresponding to any one category may be equal to the ratio of ROI samples belonging to that category among all 168 ROI samples.

a13, selecting a main modeling wave band from 10 alternative wave bands according to the weighted average standard deviation of each alternative wave band.

The selection mode may be that N candidate bands are sequentially selected from small to large as the main modeling band according to the weighted average standard deviation. Where N may be set to 8 or may be set to other values.

(3) Selecting the ratio of the band 2 with the highest correlation to the band 1 as an independent variable, taking the total ginsenoside content (%) as a dependent variable, and constructing a unitary linear regression model (figure 7) by using a modeling sample obtained by random segmentation, wherein the regression equation is as follows: y= -18.284x+4.2417, r ² ＝0.7427；

In step (3), the reflectance ratio of the ROI sample between two by two may be calculated for each of the two principal modeling bands first.

By way of example, assuming that the primary modeling bands include bands 1 through 8, the 8 bands may be combined two by two, and then for each combination, the reflectance ratio of each ROI sample over that combination is calculated. Taking the combination of band 1 and band 2 as an example, the manner of calculating the reflectance ratio for the ROI sample over a band combination may be:

Determining the average value of the relative reflectivity of the ROI sample in the band 1, determining the average value of the relative reflectivity of the ROI sample in the band 2, dividing the average value of the relative reflectivity of the band with a smaller number by the average value of the relative reflectivity of the band with a larger number to obtain the reflectivity ratio of the combination of the ROI sample in the band, for example dividing the average value of the relative reflectivity of the ROI sample in the band 1 by the average value of the relative reflectivity of the ROI sample in the band 2 to obtain the reflectivity ratio of the combination of the ROI sample in the band 1 and the band 2,

then, for each combination of two bands, according to the reflectance ratio of each ROI sample in the combination and the total ginsenoside content (%) corresponding to each ROI sample, calculating by using a formula (12) to obtain the correlation degree P of the reflectance ratio of the combination and the total ginsenoside content.

Where Rai represents the ratio of the reflectivity of the ith ROI sample at the combination, and a (Rai) represents the average of the ratios of the reflectivity of all 168 ROI samples at the combination. Coi represents the total ginsenoside content (%) corresponding to the ROI sample, and a (Coi) represents the average value of the total ginsenoside content (%) corresponding to all 168 ROI samples.

And (3) performing a u-th spectrum scanning on a certain ginseng sample to obtain an ROI sample, wherein the corresponding total ginsenoside content is a content value obtained by performing a u-th total ginsenoside content measurement on the ginseng sample, and the value of u is in the range of 1 to 3.

And after obtaining the correlation degree of each combination of every two wave bands, taking absolute values of the correlation degrees, selecting the largest absolute value, and determining two wave bands in the combination corresponding to the absolute values as target wave bands, wherein the reflectivity ratio of the two target wave bands is the self-variable value in the unitary linear regression model to be constructed.

For example, by comparing absolute values of the correlation, it is found that the reflectance ratio of the band 1 and the band 2 and the absolute value of the correlation of the total ginsenoside content are maximum, and thus the band 1 and the band 2 are determined as target bands.

After the independent variable and the dependent variable are determined, a data point formed by the independent variable value and the dependent variable value can be determined for each ROI sample, then any one of the prior art unitary linear regression algorithm is utilized to fit the data points, and a unitary linear regression model for predicting the total ginsenoside content can be obtained, and the fitting process can refer to the related prior art and is not repeated.

(4) In order to ensure the reliability of accuracy verification, the obtained regression equation is applied to a verification sample to calculate an estimated value of the total ginsenoside content, and the accuracy of a total ginsenoside content prediction model is verified by using Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) in combination with an actual measurement value of the verification sample (FIG. 8), wherein the obtained MAPE is 10.08, and the verification set RMSE is 0.35. For most model predictive applications, it is generally considered that prediction accuracy is high when MAPE is less than 10, and that rmSE values between 0.1 and 1 are acceptable, with smaller values being better. It should be noted that, since the modeling and verification data sets obtained by random segmentation are different each time, the regression model and verification accuracy obtained each time are also different.

168 samples are selected from the hyperspectral image of the main body file, the main body file is divided into two parts according to the proportion of 3:1, three times of division are randomly carried out, three times of corresponding correction models of the total saponins of the ginseng are respectively obtained, and the total saponins of the ginseng is predicted for the verification set.

The total ginsenoside content correction model (random segmentation model 1) is shown in figure 9; the actual measurement value and the estimated value (random segmentation verification 1) of the total saponins content of the verification set ginseng are shown in figure 10; the total ginsenoside content correction model (random segmentation model 2) is shown in figure 11; the actual measurement value and the estimated value (random segmentation verification 2) of the total saponins content of the verification set ginseng are shown in figure 12; the total ginsenoside content correction model (random segmentation model 3) is shown in figure 13; the actual measurement value and the estimated value (random segmentation verification 3) of the total saponins content of the verification set ginseng are shown in fig. 14.

From fig. 9-14, it can be seen that the regression model obtained from 3 random segmentations and the verification accuracy remained substantially unchanged or changed little.

Example 2 selection of band ranges

The experimental conditions and the method in the example 1 are adopted to simultaneously examine the spectral data in the wave band range of 400-1000nm, and the result is shown in fig. 15, and only one broad peak is found to be positioned between 600-700nm, mainly water molecule reflection spectrum, but the wave peak in the wave band range of 900-1700nm is more, the information amount is large, and most of the information amount is the reflection spectrum of the total ginsenoside molecules, so the wave band range of 900-1700nm is adopted to predict the total ginsenoside content in the invention.

Claims

1. A ginseng total saponin content prediction system, comprising:

the data receiving module is configured to receive hyperspectral images of 900-1700 nm wave bands of the ginseng sample to be detected;

the characteristic extraction module is configured to cut the hyperspectral image of the ginseng sample to be detected, form a main body file through mosaic processing of the cut image, conduct reflectivity correction on hyperspectral data of the main body file, convert DN values of original images into relative reflectivity R values of the images, extract a main body file region of interest on the basis of the R values on the hyperspectral data of the main body file after reflectivity correction, and generate a plurality of main body file regions of interest;

the calculation module is configured to perform principal component analysis operation processing on the hyperspectral data of the region of interest of the main body file, use covariance matrix calculation to extract spectrum information of main characteristic wave bands reflecting the spectrum information of the ginseng main body, then select the first 10 main wave band intervals obtained through principal component analysis according to single-band imaging saturation, select modeling wave bands, and select R values of two wave bands with highest correlation from the selected modeling wave bands to make a ratio;

The content prediction module: the content prediction module is configured to input the R-value ratio calculated by the calculation module into a ginseng total saponin content prediction model constructed by the method according to any one of claims 2 to 7 to predict the ginseng total saponin content of the ginseng sample to be measured.

2. A construction method of a ginseng total saponin content prediction model based on machine learning comprises the following steps:

2) Cutting the hyperspectral images of the personal reference samples, forming a main body file by mosaic processing of the cut images, carrying out reflectivity correction on hyperspectral data of the main body file, and converting the pixel brightness DN value of the remote sensing image of the original image into the relative reflectivity R value of the image;

3) Carrying out region-of-interest extraction on the hyperspectral data of the subject file subjected to reflectivity correction based on the R value, and generating a plurality of subject file regions-of-interest;

5) Performing principal component analysis and operation processing on the hyperspectral data of the region of interest of the main body file, using covariance matrix calculation, extracting spectral information of a main characteristic wave band reflecting the main body spectral information of the ginseng, and eliminating useless spectral information;

3. The method according to claim 2, wherein the generating of the body file in step 2) comprises: and cutting out the hyperspectral image of each ginseng according to the shape edge of the ginseng, generating a mask file at the outer part of the cutting area, and performing mosaic merging processing based on pixels on a plurality of cut images to form a main body file.

4. The method according to claim 2, wherein in step 2), the reflectance correction is performed by the following formula:

5. The method of claim 2, wherein the argument is selected by: the R value of the selected modeling wave band is subjected to ratio, and the absolute value of the correlation coefficient between the total ginsenoside content of the ginseng and the ratio of each wave band is calculated; and selecting the ratio of R values of two wave bands with the highest values of the absolute values of the correlation coefficients as independent variables.

6. The method according to claim 2, characterized in that: the selection of the modeling band in step 6) includes: selecting the first 10 main wave band intervals extracted by PCA according to single wave band imaging saturation, performing supervised classification operation on the interested region of the main body file by using an SVM algorithm to obtain classification results of different characteristics of ginseng, performing statistical analysis and confusion matrix precision evaluation on the SVM characteristic classification results, selecting modeling wave bands by taking standard deviation of characteristic wave bands in the statistical analysis results as indexes, selecting the first 8 alternative wave bands as main modeling wave bands by comparing standard deviation values of the characteristic wave bands and taking the standard deviation values of the characteristic wave bands as selection basis, and then selecting the ratio of R values of two wave bands with highest relativity from the main modeling wave bands as independent variables.

7. The method of claim 6, wherein the performing a supervised classification operation on the region of interest of the subject document using an SVM algorithm comprises the steps of:

8. A method for predicting total ginsenoside content comprises the following steps:

receiving a hyperspectral image of a 900-1700 nm wave band of a ginseng sample to be detected;

cutting the hyperspectral image of the ginseng sample to be detected, forming a main body file by embedding the cut image, carrying out reflectivity correction on hyperspectral data of the main body file, converting DN values of original images into R values of relative reflectivity of the images, extracting a main body file region of interest from the hyperspectral data of the main body file subjected to reflectivity correction based on the R values, and generating a plurality of main body file regions of interest;

performing principal component analysis operation processing on the hyperspectral data of the region of interest of the main body file, using covariance matrix calculation to extract spectral information of main characteristic wave bands reflecting the spectral information of the ginseng main body, selecting the first 10 main wave band intervals obtained through principal component analysis according to single-band imaging saturation, selecting modeling wave bands, and selecting R values of two wave bands with highest correlation from the selected modeling wave bands to be used as a ratio;

inputting the ratio of the calculated R values into a ginseng total saponin content prediction model constructed by the method according to any one of claims 2 to 7 to predict the ginseng total saponin content of the ginseng sample to be measured.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for predicting total ginsenoside content of ginseng of claim 8 when executing the program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method for predicting total ginsenoside content of ginseng of claim 8.