CN112287601A

CN112287601A - Method and medium for constructing tobacco leaf quality prediction model by using R language and application

Info

Publication number: CN112287601A
Application number: CN202011141976.2A
Authority: CN
Inventors: 李伟; 王攀磊; 鲁耀; 张静; 刘浩; 董石飞; 杨应明; 王超; 耿川雄; 陈拾华; 杨景华; 王建新; 聂鑫; 朱海滨; 林昆; 杨义; 段宗颜; 张忠武; 严君; 邹炳礼
Original assignee: Hongyun Honghe Tobacco Group Co Ltd; Institute of Agricultural Environment and Resources of Yunnan Academy of Agricultural Sciences
Current assignee: Hongyun Honghe Tobacco Group Co Ltd; Institute of Agricultural Environment and Resources of Yunnan Academy of Agricultural Sciences
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-29
Anticipated expiration: 2040-10-23
Also published as: CN112287601B

Abstract

The invention belongs to the technical field of tobacco leaf quality prediction, and discloses a method, a medium and application for constructing a tobacco leaf quality prediction model by using an R language, wherein data transformation and screening processing are respectively carried out on prediction variables; creating a prediction variable set and a result variable set, and respectively carrying out segmentation and resampling on data; selecting a plurality of regression methods for modeling; using the root mean square error RMSE and the coefficient of determination R²And evaluating the prediction effects of the different models, and selecting the optimal model from the test models according to the model effect. The ecological factor model suitable for predicting the quality of the tobacco leaves can predict the quality wave of single-grade tobacco leaves in different areas of the year according to the change condition of ecological climate in the yearAnd the purchasing grade and quantity of the tobacco leaves are adjusted in a targeted manner according to the dynamic condition, the purchasing grade quantity and proportion of the tobacco leaves are actively adjusted, and the quality stable supply of the tobacco leaves is ensured.

Description

Method and medium for constructing tobacco leaf quality prediction model by using R language and application

Technical Field

The invention belongs to the technical field of tobacco leaf quality prediction, and particularly relates to a method, a medium and application for constructing a tobacco leaf quality prediction model by using an R language.

Background

Currently, tobacco quality is a result of the combined action of genetic factors, ecological environment and cultivation techniques. Numerous researches show that ecological factors such as climate, soil, topography and the like are important factors influencing the agronomic characters, physical characteristics, chemical components, disease rate, aroma substance content and smoking quality of tobacco leaves, particularly the characteristic characteristics of the tobacco leaf quality such as multifactor, polytropy and difficult quantification, the influence of the ecological environment is more prominent, and the tobacco leaf quality in different planting areas and different years has larger difference due to the change of light, temperature, water and gas conditions. Therefore, an ecological factor model for predicting the quality of the tobacco leaves is constructed, and the ecological factors such as climate, soil and cultivation management are used for predicting the quality change of the tobacco leaves, so that the method is very important for improving the quality of the tobacco leaves.

Through the above analysis, the problems and defects of the prior art are as follows: at present, a prediction model mostly focuses on predicting the sensory quality of tobacco leaves by utilizing the internal chemical components of the tobacco leaves, a research method about the correlation between ecological factors and the quality of the tobacco leaves mostly focuses on researching the influence and contribution of the ecological factors on the quality of the tobacco leaves through methods such as principal component regression analysis, grey correlation analysis and the like, finding out key ecological factors, and guiding the production of the tobacco leaves through the regulation and control of the key ecological factors. However, no prediction model is available for predicting the tobacco quality by using external ecological factors of the growth of flue-cured tobacco.

The difficulty in solving the above problems and defects is: on one hand, the construction of a prediction model requires a large amount of complete tobacco leaf quality and corresponding ecological factor data; on the other hand, the data types involved in the invention are complex, and have both continuous type variables and dependent type variables, and the prediction model constructed by each regression method has uncertainty.

The significance of solving the problems and the defects is as follows: therefore, the invention chooses to construct a prediction model using the R language, which can provide a variety of regression methods. The R language is open source software for mathematics and statistical calculation, can provide as many models as possible, carries out relatively complex prediction model construction on mass data, explores the uncertainty of the model through strict training and testing and selects the optimal model. The workload and the cost of tobacco leaf detection are reduced, and the problems of tobacco leaf raw material supply and blending caused by the lag of tobacco leaf quality detection are solved. According to the current year ecological climate condition, the quality of the tobacco leaves is evaluated and predicted by using the prediction model, the stable supply of the grade and the quantity of the tobacco leaf raw materials in the cigarette formula module is ensured, and the stable quality of the cigarette products is ensured.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method, a medium and application for constructing a tobacco leaf quality prediction model by using an R language.

The invention is realized in such a way that the tobacco quality prediction method based on the ecological factor model comprises the following steps:

step one, respectively carrying out data transformation and predictive variable screening processing on predictive variables in tobacco quality prediction;

creating a prediction variable set and a result variable set in the tobacco quality prediction, and respectively carrying out segmentation and resampling on data;

thirdly, selecting a plurality of regression methods to model the data; and obtaining prediction models in the quality prediction of different tobacco leaves.

Step four, adopting the root mean square error RMSE and the decision coefficient R²And evaluating the prediction effects of the different prediction models, and selecting an optimal model from the prediction models according to the prediction effects.

Further, in the first step, the data transformation performed on the prediction variables includes centralization, normalization and skewness transformation; the centralization is that all variables are subtracted by the mean value, and the result is that the mean value of the transformed variables is 0; the normalized data is the division of each variable by its own standard deviation, normalization forces the standard deviation of the variable to be 1; the skewness transformation can remove distribution skewness, so that right-biased distribution or left-biased distribution is transformed into unbiased distribution, and variables are distributed approximately symmetrically.

Further, in step one, the method for performing data transformation on the predictor variable includes:

(I) constructing a trans function by using a preprocess function in a caret packet, and simultaneously carrying out centralized center, standardized scale and skewness transformation BoxCox processing on data;

(II) after construction of trans function, transformation of the original data using the predict function.

Further, in step one, the method for screening the predictive variable includes:

(1) removing a zero variance variable: the near-zero variance variable to be filtered is detected using the nearzero zerovar function in the caret package: if the display data set has a zero variance variable, the variable needs to be removed;

(2) multiple collinearity variables are removed.

Further, in step (2), the method for removing multiple collinearity variables includes:

1) calculating a correlation coefficient matrix among the prediction variables by using a cor function in a corrplot packet;

2) finding out the pair of predictive variables with the maximum absolute value of the correlation coefficient by using a findCorrelation function, and marking as the predictive variables A and B;

3) calculating the mean value of the correlation coefficients of the A and other prediction variables by using a head function, performing the same calculation on the B, and listing a variable column with high correlation coefficient;

4) if the average correlation coefficient of A is larger, removing A; otherwise, removing B;

5) and repeating the steps 2) -4) until the absolute values of all correlation coefficients are lower than the set GUO value.

Further, in step two, the method for creating a set of predictor variables and a set of result variables includes:

(I) establishing predictor sets for the first 1 to n predictor variable columns in the data set;

and (II) establishing a result variable set result by using the result variable column of the (n + 1) th column in the data set.

Further, in step two, the method for performing segmentation processing on data includes:

(1) randomly selecting a test sample from the samples by using a createdatatation function in the caret packet to construct a training set;

(2) after a training line is obtained, a prediction variable training set TrainPredictor and a result variable training set TrainResult containing the training line are created;

(3) and meanwhile, creating a predictor variable test set TestPredictors and a result variable test set TestResult by using the residual samples.

Further, in step two, the method for resampling data includes: k-fold cross-validation resampling can be achieved using the trackcontrol function in the caret packet.

Further, the K-fold cross-validation method comprises:

1) randomly dividing the samples into k subsets of comparable size, and first fitting the model with all samples except the first subset;

2) predicting the reserved first folded sample by using the model, and evaluating the model by using the result of the prediction;

3) then the first subset is returned to the training set, the second subset is reserved for model evaluation, and the like;

4) calculating the mean value and the standard deviation of the obtained k model evaluation results, and then calculating the relationship between the demodulation optimal parameter and the model performance based on the evaluation results.

Further, in the fourth step, the test model selects a linear regression model, a nonlinear regression model and a regression tree model; the linear regression model comprises a generalized linear model, a stepwise regression linear model and a partial least square regression model; the nonlinear regression model comprises a Support Vector Machine (SVM) model and a K nearest neighbor model; the regression tree models include simple regression trees, regression model trees, random forests, and cubist models.

Further, in step four, the model is predicted and evaluated using the train function in the caret package; the predicted effect of each model was evaluated using the samples function in the caret, and the model results were viewed using sum (samples).

Further, in the model comparison result, RMSE and R can be determined according to each model²Preference model, the smaller the RMSE, the higher the model prediction accuracy, R²The larger the model, the better the degree of simulation.

Another object of the present invention is to provide a computer-readable storage medium storing instructions which, when executed on a computer, cause the computer to perform the method for predicting tobacco quality based on an ecological factor model.

Another object of the present invention is to provide a computer terminal, comprising:

the transformation and screening module is used for respectively carrying out data transformation and predictive variable screening processing on predictive variables in the tobacco quality prediction;

the segmentation resampling module is used for creating a prediction variable set and a result variable set in tobacco quality prediction and respectively segmenting and resampling data;

the prediction model acquisition module is used for selecting multiple regression methods to model the data; and obtaining prediction models in the quality prediction of different tobacco leaves.

An optimal model screening module for using the root mean square error RMSE and the decision coefficient R²Evaluating the prediction effects of the different prediction models, and selecting the prediction models according to the prediction effectsAnd selecting an optimal model.

The invention also aims to provide application of the tobacco quality prediction method based on the ecological factor model in the quality detection of the tobacco in agronomic characters, physical characteristics, chemical components, disease rate, aroma substance content, smoking quality, different planting areas and different years.

By combining all the technical schemes, the invention has the advantages and positive effects that: the tobacco quality prediction method based on the ecological factor model provided by the invention constructs an ecological factor optimal model for predicting the tobacco quality by utilizing an R language. The R language is open source software for mathematics and statistical calculation, relatively complex prediction model construction can be carried out by utilizing mass data, each prediction model has uncertainty, the R language can provide as many models as possible, the uncertainty of the models is explored through strict training tests, and the optimal models are selected. The model provided by the invention can predict the quality fluctuation conditions of single-grade tobacco leaves in different areas in the current year according to the ecological climate change condition in the current year, realize targeted adjustment of the purchasing grade and quantity of the tobacco leaves, actively adjust the purchasing grade quantity and proportion of the tobacco leaves, and ensure stable quality supply of the tobacco leaves.

Drawings

FIG. 1 is a flow chart of a tobacco leaf quality prediction method based on an ecological factor model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a tobacco leaf quality prediction method based on an ecological factor model, and the invention is described in detail below with reference to the accompanying drawings.

The tobacco leaf quality prediction method based on the ecological factor model provided by the embodiment of the invention comprises the following steps: respectively carrying out data transformation and predictive variable screening processing on predictive variables in the tobacco quality prediction;

creating a prediction variable set and a result variable set in tobacco quality prediction, and respectively carrying out segmentation and resampling on data;

selecting a plurality of regression methods to model the data; and obtaining prediction models in the quality prediction of different tobacco leaves.

Using the root mean square error RMSE and the coefficient of determination R²And evaluating the prediction effects of the different prediction models, and selecting an optimal model from the prediction models according to the prediction effects.

Specifically, as shown in fig. 1, the tobacco leaf quality prediction method based on the ecological factor model provided by the embodiment of the present invention includes the following steps:

s101, data preprocessing: and respectively carrying out data transformation and predictive variable screening treatment on the predictive variables.

S102, data division: and creating a prediction variable set and a result variable set, and respectively carrying out segmentation and resampling processing on the data.

S103, data modeling: and selecting a plurality of regression methods to model the data.

S104, model optimization: using the root mean square error RMSE and the coefficient of determination R²And evaluating the prediction effects of the different models, and selecting the optimal model from the test models according to the model effect.

The present invention will be further described with reference to the following examples.

Example 1

1. Data pre-processing

The data preprocessing technology generally refers to addition, deletion or transformation of training set data, and the data are transformed to reduce the influence of data skewness and outliers, so that the performance of the model can be remarkably improved.

1.1 data transformation

Predictive models require the predictive variables to have the same dimension or scale, requiring data transformation of the variables, i.e., centralization, normalization, and skewness transformation. Centralization subtracts the mean value from all variables, resulting in a transformed mean value of the variables of 0. The normalization data divides each variable by its own standard deviation, and the normalization forces the standard deviation of the variables to be 1. The skewness transformation can remove the distribution skewness, transform the right-biased distribution or the left-biased distribution into the unbiased distribution, and approximately symmetrically distribute the variables.

The method uses the preprocess function in the caret package to construct the trans function, and simultaneously performs centralization (center), standardization (scale) and skewness transformation (BoxCox) processing on data, wherein the construction process of the trans function is as follows:

trans<-preProcess(tobacco.numeric,

method＝c("BoxCox","center","scale"))

after a trans function is constructed, a predict function is used for converting original data, wherein data is the original data, and transformed data is converted data.

transformed.data<-predict(trans,data)

1.2 predictive variable screening

Some of the predictor variables need to be removed prior to modeling to improve model performance and stability. And the complexity of calculation is reduced by using fewer variables for prediction, and a more concise and easily-explained model is obtained more easily by removing redundant prediction variables.

1.2.1 removing zero variance variables

The zero variance variable refers to a predictive variable with only one value, and the zero variance variable hardly contributes to the model, so that the zero variance variable needs to be distinguished and removed. If the ratio of the number of non-repeated values to the sample size is low (such as 10%), and the ratio of the highest frequency number to the second highest frequency number is high, the variance variable is zero.

The method uses the function nearzero zerovar to be filtered in the caret packet:

nearZeroVar(data)

if there is a zero variance variable in the display dataset, this variable needs to be culled.

1.2.2 removal of multiple collinearity variables

Collinearity refers to a strong correlation between a pair of predictors, and collinearity between multiple predictors is called multicollinearity. Because redundant predictors generally increase the complexity of the model rather than the information content, and the use of highly correlated predictors in linear regression models may give a very unstable model, the predictors should avoid highly correlated variables in the data. The specific algorithm is as follows:

1. calculating a correlation coefficient matrix of the predictive variable;

2. finding out the pair of predictive variables (marked as predictive variables A and B) with the maximum absolute value of the correlation coefficient;

3. calculating the mean value of the correlation coefficients of the A and other prediction variables, and performing the same calculation on the B;

4. if the average correlation coefficient of A is larger, removing A; otherwise, removing B;

5. and repeating the steps 2 to 4 until the absolute values of all correlation coefficients are lower than the set GUO value.

In the following commands, data is a data set, and correlations is a correlation coefficient matrix between every two predictive variables in the data set.

correlations<-cor(data)

After the correlation coefficient is calculated, a predictive variable with a high correlation coefficient is searched for by using a findCorrelation function, wherein correlations are used as a correlation coefficient matrix, highcorr, correlations are used as predictive variables with a screened correlation coefficient of more than 0.75, and cutoff is a set threshold value for screening the correlation coefficient:

highcorr.correlations<-findCorrelation(correlations,cutoff＝0.75)

using the head function, the variable columns with high correlation coefficients are listed:

head(highcorr.correlations)

and then removing variable columns with high correlation coefficients, wherein in the following commands, data after the filtered removes multiple co-linear variables:

data.filtered<-data[,-highcorr.correlations]

2. data partitioning

2.1 creating sets of predictor variables and result variables

When a prediction model is constructed, a data structure comprises a plurality of prediction variables and an outcome variable, and independent data sets are required to be established for the prediction variables and the outcome variable respectively.

The following commands establish predictor set predictors for the first 1 to n predictor variable columns in the data dataset:

predictors<-data[,1:n]

establishing a result variable set result for the result variable column of the (n + 1) th column in the data set according to the following commands:

result<-data[,n+1]

thus, prediction variable sets predictors and result variable sets result are respectively established.

2.2 data partitioning

Some models learn noise characteristics specific to each sample while learning data generalization patterns, called overfitting. Overfitting often does not accurately predict the new samples. Inappropriate tuning parameters may result in overfitting of the model, requiring the model parameters to be adjusted through the data to give the best fit prediction. Thus, the data used to evaluate the model is not applied to build or debug the model, which can give an unbiased estimate of the model's effectiveness. When the prediction model is established, a part of samples can be selected to construct the prediction model, and the rest of samples are reserved for model evaluation. The set of samples used for modeling is referred to as the "training set" and the set of samples used for verifying the model performance is referred to as the "test set".

The method can randomly select test samples from the samples by using a createdatatartion function in the caret packet to construct a training set. In the following commands, data is a data set, training represents a randomly drawn sample line divided into a training set, and p 0.8 represents that 80% of the sample line is drawn as the training set

trainningrows<-createDataPartition(data,

p＝0.8,

list＝FALSE)

After the training line is obtained, a predictive variable training set TrainPredictor and a result variable training set TrainResult containing the training line are created

TrainPredictors<-predictors[trainningrows,]

TrainResult<-result[trainningrows]

Meanwhile, creating a prediction variable test set TestPredicors and a result variable test set TestResult with the remaining samples

TestPredictors<-predictors[-trainningrows,]

TestResult<-result[-trainningrows]

2.3 resampling

The resampling technique is that a sub-sample in a test set is used to fit a model, then the model is evaluated by other samples in the test set, the process is repeated for many times, and then the result is summarized. The resampling method can reasonably evaluate the performance of the model predicted on future samples. The samples may be resampled using a variety of sampling methods.

The method uses a K-fold cross-validation method, and adopts the principle that samples are randomly divided into K subsets with equivalent sizes, a model is fitted by all samples except a first subset (first folding), then the model is used for predicting a reserved first folding sample, the model is evaluated by using the result, then the first subset is returned to a training set, a second subset is reserved for model evaluation, and the like. The k model estimates thus obtained are summed (typically by calculating the mean and standard deviation) and then based on this the relationship between the demodulation parameters and the model performance is determined.

K-fold cross-validation resampling can be achieved using the raincontrol function in the caret packet, and in the following command, raincontrol is the resampling function, where method ═ cv "indicates that K-fold cross-validation is used, and number ═ 10 indicates 10-fold.

trainControl(method＝"cv",number＝10)

3. Data modeling

The method selects a plurality of regression methods to model the data, and selects an optimal model from the test models according to the model effect. The method selects a linear regression model, a non-linear regression model and a regression tree model. The linear regression model comprises a generalized linear model, a stepwise regression linear model and a partial least square regression model; the nonlinear regression model comprises a Support Vector Machine (SVM) model and a K nearest neighbor model; regression tree models include simple regression trees, regression model trees, random forests, and cubist models.

The prediction and evaluation of the above models both use the train function in the caret package, and the general commands are as follows, where fit denotes the model, x denotes the regression method used by the different models (method commands used by the different models are as follows), and trControl specifies the resampling method, which is cross validation by 10.

fit<-train(x＝TrainPredictors,y＝TrainScore,

method＝"x",

trControl＝trainControl(method＝"cv",number＝10))

4. Model optimization

Using Root Mean Square Error (RMSE) and coefficient of determination (R)²) And evaluating the prediction effects of different models. RMSE is a function of the model residual, i.e., the observed value minus the predicted value of the model, which accounts for the average distance between the observed value and the predicted value of the model. Determining the coefficient (R)²) Interpreted as the proportion of the information contained in the data that can be interpreted by the model.

The prediction effect of each model was evaluated using the samples function in the caret, in the following order, sample is the result of each model evaluation, the model results can be viewed using sum (sample), and fit1, fit2, fit3 represent different models.

resample<-resamples(list(fit1,fit2,fit3))

summary(resamp)

In the model comparison results, the RMSE and R of each model can be determined²Preference model, the smaller the RMSE, the higher the model prediction accuracy, R²The larger the model, the better the degree of simulation.

5. Model validation

(1) Model prediction

A predictive model was constructed using the training set above and based on RMSE and R²Better performing models are preferred. The part tests the prediction effect of each optimized model by using the prediction function and the test set data. The commands are as follows, in the following commands, predict is the prediction function, fit is the model under test, and TestPredictors are the prediction variables of the test set.

PredictedResult<-predict(fit,TestPredictors)

(2) Model validation

And obtaining a predicted value PredictedResult according to model prediction, and comparing the predicted value PredictedResult with an observed value TestResult of the test set to measure the model prediction effect. The model quality was measured by the following 2 visualization methods.

1) And (4) understanding the model fitting effect by the observed value and the predicted value scatter diagram. A scatter plot of observed and predicted values is presented using the plot function. The predicted value and the observed value of the ideal model are distributed along an oblique line with the slope of 1, and the closer to the oblique line, the better the model prediction effect is.

plot(TestScore,PredictedResult)

(2) Systematic mode for displaying predicted values by scatter diagram of residual errors and predicted values

The difference between the observed value and the predicted value is the model residual, and is calculated by using the following commands:

residualvalues<-TestResult-Predictedresult

model with no systematic error, the residual should be distributed uniformly around 0, and plot of residual and predicted values can be shown using plot function.

plot(PredictedResult,residualvalues)

(3) RMSE and R for calculating observed and predicted values²

Using RMSE and R²Functional computingAnd (3) the fitting effect of the observed value and the predicted value is ordered as follows:

R2(PredictedResult,TestResult)

RMSE(PredictedResult,TestResult)

in the same way, R²The larger the model prediction effect, the better the fitting between the representative observed value and the predicted value, the smaller the RMSE, the closer the representative predicted value and the observed value are, and the better the model prediction effect is.

Example 2

1. Data pre-processing

Firstly, data required by the model is preprocessed, and the data is transformed to reduce the influence of data skewness and outliers, so that the model performance can be obviously improved.

1.1 data transformation

1.1.1 importing data

library(readxl)

Load data reading packet "readxl" (R packet, R function, is a collection of code and sample data)

tobacco<-read_excel("tobacco.xlsx",col_names＝TRUE)

Data are imported and named "tobaco"

1.1.2 data structures and transformations

(1) Viewing data structures

str(tobacco)

The example data includes 595 samples, 51 variables, 50 predictor variables, and 1 result variable. The variables are predicted as 6 symbols and 44 continuous variables.

Character-type variables in the predictive variables need to be converted into factor types. In this example, 6 character-type variables of the variables Area, cutivar, Position, soil type, landform, and transplant were converted into factor types.

tobacco$Area<-factor(tobacco$Area)

tobacco$Cultivar<-factor(tobacco$Cultivar)

tobacco$Position<-factor(tobacco$Position)

tobacco$soiltype<-factor(tobacco$soiltype)

tobacco$landform<-factor(tobacco$landform)

tobacco$transplant<-factor(tobacco$transplant,levels＝c("early","middle","late"),ordered＝TRUE)

Continuous variables were centered, normalized and biased, in this case TN (total nitrogen), Ni (nicotine), TS (total sugar), RS (reducing sugar), K (potassium), Cl (chlorine), PE (petroleum ether), St (starch), N/Ni (nitrogen-to-base ratio), "RS/Ni '(sugar-to-base ratio), DS (sugar-to-sugar ratio), K/Cl (potassium-to-chlorine ratio), particulate size (soil particle size), elevation, ph, som (soil organic matter), an (soil available nitrogen), ap (soil available phosphorus), ak (soil available potassium), scl (soil chlorine), B (soil boron), growth period, leaf number), fertilization (nitrogen amount), mayrainfanfal (5-month rainfall amount), juneranfaraff (6-month rainfall amount), junmerely rain fall amount, 7-month rainfall amount, 8-month rainfall amount (8-month rainfall amount), and average rainfall amount (8-month rainfall amount) using the propertymet's function of the caret package, Conversion was carried out with 44 continuous variables, maytem (month 5 temperature), junetem (month 6 temperature), julytem (month 7 temperature), augusttem (month 8 temperature), grewthtem (month growth temperature), maysun (month 5 light), junesun (month 6 light), julysun (month 7 light), augustsun (month 8 light), grewthsun (month growth light), mayhumidate (month 5 humidity), junehmidity (month 6 humidity), junyhumdity (month 7 humidity), augusthumidate (month 8 humidity), and grewhmidity (month growth humidity).

library(caret)

# load caret Package

tobacco.numeric<-as.data.frame(tobacco[,c(5:34,38:69)])

# screening digital data and creating a data set

trans<-preProcess(tobacco.numeric,

method＝c("BoxCox","center","scale"))

# integrates 3 functions of skewness conversion, centralization and standardized transformation by using a preprocess function to construct a trans function.

tobacco.transformed.numeric<-predict(trans,tobacco.numeric)

# continuous variables were transformed using the trans function.

tobacco.factor<-tobacco[,c(1:4,35:37)]

tobacco.transformed<-cbind(tobacco.factor,tobacco.transformed.numeric)

# integrates a factorial predictor and a continuum predictor.

1.2 predictive variable screening

1.2.1 removing zero variance variables

The near zero variance variable to be filtered is detected using the function nearest zero var in the caret package.

nearZeroVar(tobacco.transformed.numeric)

1.2.2 removal of multiple collinearity variables

Removal of highly multicollinear variations in chemical composition

library(corrplot)

Load dependency coefficient calculation packet

Removing variable with high multiple collinearity in chemical components of tobacco leaves

tobacco leaf chemical component extraction prediction variable tbacacco chemical transformation [,26:37] #

chemical < -cor (tobaco chemical) # calculates correlation coefficient

Chemical < -correlation (correlation. chemical, cutoff is 0.75) # looks for variables with correlation coefficients above 0.75

head (high corr chemical) # lists the variable column with high correlation coefficient, in this example, RS/Ni, TS, Cl are multiple collinearity variables

##[1]10 3 6

Filtered into. tobaco. chemical [, -highcorr. chemical ] # removes multiple collinearity variables

Removing multiple high-collinearity variables in ecological factors

Ecological factor prediction variable extraction from tobaco

correlation coefficient was calculated for correlation between the correlation and the electrical signal of the core (tobaco

highcorr.ecological<-findCorrelation(correlations.ecological,cutoff＝0.75)

# search for variables having a correlation coefficient of 0.75 or more

head (higher correlation. ecological) # lists the variable column with high correlation coefficient

##[1]322817291530

In this example, humidity of Mayhumy (Mayhumurity), humidity of June (junehumurity), humidity of July (Julyhumidity), humidity of growing period (growthhumidity), rainfall of July (Julayrainfall), and rainfall of growing period (growthrainfall) are multiple co-linear variables.

Filtered into numerical [, -high corr. technical ] # removes multiple collinearity variables

And integrating the converted and screened variables to serve as a prediction variable data set.

tobacco.filtered<-cbind(tobacco.factor,tobacco.chemical.filtered,tobacco.ecological.filtered)

And the data set is exported, so that the later use is facilitated.

write.csv(tobacco.filtered,"tobacco.filtered.csv",row.names＝FALSE,col.names＝TRUE)

2. Data set construction

2.1 creating sets of predictor variables and result variables

Importing preprocessed data

tobacco.filtered<-read.csv("tobacco.filtered.csv")

tobacco<-read_excel("tobacco.xlsx",col_names＝TRUE)

Creating a set of predicted variables

(1) Creation of prediction variable sets for conventional prediction models, such as linear regression models

predictors<-tobacco.filtered[,-c(4,8:25)]

(2) Creation of prediction variable set suitable for random vector machine, K neighbor and other models

ind.Area<-nnet::class.ind(predictors$Area)

ind.Cultivar<-nnet::class.ind(predictors$Cultivar)

ind.Position<-nnet::class.ind(predictors$Position)

ind.soiltype<-nnet::class.ind(predictors$soiltype)

ind.landform<-nnet::class.ind(predictors$landform)

ind.transplant<-nnet::class.ind(predictors$transplant)

ind<-cbind(ind.Area,ind.Cultivar,ind.Position,ind.soiltype,ind.landform,ind.transplant)

trans.1<-preProcess(ind,method＝c("BoxCox","center","scale"))

ind.transformed<-predict(trans.1,ind)

predictors.ind<-cbind(ind.transformed,predictors[,-c(1:6)])

And creating a result variable set, wherein the result variable refers to the sensory quality evaluation score of the tobacco leaves, and the result variable in the example is the sensory quality evaluation total score of the tobacco leaves.

score<-tobacco$SCORE

2.2 data partitioning

2.2.1 data partitioning

Training sets and test sets of predictor variables and outcome variables are created, respectively.

set (222) # sets random number seeds to ensure repeatable results

trainningrows<-createDataPartition(score,

p＝0.8,

list＝FALSE)

In this example, 80% of the sample rows are randomly selected as training rows, and the rows represent the samples divided into the training set

TrainPredictors<-predictors[trainningrows,]

Train predictors. ind < -predictors. ind, # selects prediction variable samples to training set

Selecting result variable samples from TrainScore < -score [ trainingrows ] # to a training set

TestPredictors<-predictors[-trainningrows,]

Test predictors. ind < -predictors. ind [ -following, ] # samples of predictor variables are taken to test set

Selecting result variable samples from a test score < -score [ -training gases ] # to a test set

2.2.2 resampling

In this example, a 10-fold cross-resampling method is selected. the instructions in the train function are as follows:

trControl＝trainControl(method＝"cv",number＝10)

3. data modeling

3.1 Linear regression model

3.1.1 generalized Linear model

Inputting a code:

set.seed(222)

glm1<-train(x＝TrainPredictors,y＝TrainScore,

method＝"glm",

trControl＝trainControl(method＝"cv",number＝10))

glm1

##GeneralizedLinearModel

and outputting a result:

model #477samples prediction adopted 477sample size

Prediction of #30predictor # model Using 30predictor variables

# sampling Cross-Validated (10fold) # Resampling method: 10fold cross validation

##Summary ofsample sizes:429,430,430,430,429,429,...

# Resampling results

##RMSE Rsquared MAE

##3.001024 0.03957612 2.327293

3.1.2 stepwise regression of Linear models

Inputting a code:

set.seed(222)

glmstep1<-train(x＝TrainPredictors,y＝TrainScore,

method＝"glmStepAIC",

trControl＝trainControl(method＝"cv",number＝10))

and outputting a result:

3.1.3 general Linear regression

Inputting a code:

and outputting a result:

3.1.4 partial least squares plsr

Inputting a code:

and outputting a result:

3.2 nonlinear regression model

3.2.1 support vector machine SVM

Inputting a code:

and outputting a result:

3.2.2K neighbor

Inputting a code:

and outputting a result:

3.3 regression Tree model

3.3.1 simple regression Tree (Single Tree)

Inputting a code:

and outputting a result:

3.3.2 regression model Tree

Inputting a code:

and outputting a result:

3.3.3 random forest

Inputting a code:

and outputting a result:

3.3.4cubist

inputting a code:

and outputting a result:

the invention is further described below with reference to specific examples and experimental data.

4. Effect of model fitting

The effect of the fit to each model was evaluated by comparing MAE, RMSE and R2.

Inputting a code:

resamp<-resamples(list(glm＝glm1,lm＝lm1,plsr＝plsr1,SVM＝SVM1,knnTune＝knnTune,rpartTune＝rpartTune1,M

5Tune＝M5Tune1,cubist＝cubist1,randomforest＝randomforest1))

# the models were compared using the resamples function.

And outputting a result:

note: MAE (mean absolute error) is the average absolute error of the model and is the average value of the absolute error, RMSE (root mean squared error) is the root mean square error and is the square root of the average value of the square difference between the predicted value and the actual observation and is used for measuring the residual error of the model, wherein the residual error is the observed value minus the predicted value of the model, and the RMSE explains the average distance between the observed value and the predicted value of the model. Determining the coefficient (R)²) Interpreted as the proportion of the information contained in the data that can be interpreted by the model.

Model R²Values above 0.26 are preferred, values between 0.13 and 0.26 are medium, and values between 0.02 and 0.13 are poor (Cohen et al, 1988). From the comparison result of the models, the randomfortest model has the lowest MAE and RMSE values, the highest R2 value is close to 0.26, and the prediction effect is the best; and SVM and cubist models are adopted secondly, and the simulation effect of other models is poor.

5. Model predicted effect

Taking a random forest as an example, the prediction and evaluation process is as follows:

(1) prediction

And (3) predicting the test sample by using a prediction function and applying a random forest model:

randomfortest < -predict (randomfortest 1, TestPredictors) # randomfortest 1 is a random forest model, TestPredictors is a test sample, and predictedscore.

(2) MeterRMSE and R for calculating predicted values and observed values²

R2(PredictedScore.randomforest,TestScore)

##[1]0.256195

RMSE(PredictedScore.randomforest,TestScore)

##[1]2.330102

Similarly, 10 models were predicted and evaluated, and the results are shown in the following table:

in each model, the absolute error (MAE) and the Root Mean Square Error (RMSE) between the predicted value and the observed value of the random forest model are the smallest, the absolute coefficient (R2) is the largest, and the model prediction result is the best.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. The tobacco quality prediction method based on the ecological factor model is characterized by comprising the following steps of:

respectively carrying out data transformation and screening treatment on the prediction variables in the tobacco quality prediction;

selecting a plurality of regression methods to model the data; different prediction models of the tobacco leaf quality are obtained.

2. The ecological factor model-based tobacco leaf quality prediction method of claim 1, wherein the data transformation on the prediction variables includes centralization, normalization and skewness transformation; the centralization is that all variables are subtracted by the mean value, and the result is that the mean value of the transformed variables is 0; the normalized data is the division of each variable by its own standard deviation, normalization forces the standard deviation of the variable to be 1; the skewness transformation can remove distribution skewness, so that right-biased distribution or left-biased distribution is transformed into unbiased distribution, and variables are distributed approximately symmetrically.

3. The ecological factor model-based tobacco leaf quality prediction method of claim 1, wherein the method of data transformation of the prediction variables comprises:

4. The ecological factor model-based tobacco leaf quality prediction method of claim 1, wherein the predictive variable screening method comprises:

(2) removing multiple collinearity variables;

in the step (2), the method for removing multiple collinearity variables comprises the following steps:

5. The ecological factor model-based tobacco leaf quality prediction method of claim 1, wherein the method of creating a set of predictor variables and a set of outcome variables comprises:

(II) establishing a result variable set result for the result variable column of the (n + 1) th column in the data set;

the method for segmenting data comprises the following steps:

(1) randomly picking training lines from the sample using the createdatapartion function in the caret package;

6. The ecological factor model-based tobacco leaf quality prediction method according to claim 1, wherein the method for resampling the data comprises: resampling by a K-fold cross validation method can be realized by using a trainControl function in a caret packet;

the K-fold cross-validation method comprises the following steps:

7. The ecological factor model-based tobacco leaf quality prediction method of claim 1, wherein the test model selects a linear regression model, a non-linear regression model and a regression tree model; the linear regression model comprises a generalized linear model, a stepwise regression linear model and a partial least square regression model; the nonlinear regression model comprises a Support Vector Machine (SVM) model and a K nearest neighbor model; the regression tree model comprises a simple regression tree, a regression model tree, a random forest and a cubist model;

prediction and evaluation of the model using the train function in the caret package; evaluating the predicted effect of each model by using a resamples function in the caret package;

in the model comparison results, the RMSE and R are determined according to each model²Preference model, the smaller the RMSE, the higher the model prediction accuracy, R²The larger the model, the better the degree of simulation.

8. A computer terminal, characterized in that the computer terminal comprises:

the prediction model acquisition module is used for selecting multiple regression methods to model the data; obtaining prediction models in the quality prediction of different tobacco leaves;

an optimal model screening module for using the root mean square error RMSE and the decision coefficient R²And evaluating the prediction effects of the different prediction models, and selecting an optimal model from the prediction models according to the prediction effects.

9. A computer-readable storage medium storing instructions which, when executed on a computer, cause the computer to perform the method of ecological factor model-based tobacco leaf quality prediction according to any one of claims 1 to 7.

10. The ecological factor model-based tobacco leaf quality prediction method according to any one of claims 1 to 7, is applied to detection, evaluation and prediction of tobacco leaf production quality such as economic traits, disease rate, appearance quality, physical characteristics, chemical components, aroma substances, sensory evaluation and the like of tobacco leaves in different planting areas and different years.