CN116539553A

CN116539553A - Method for improving robustness of near infrared spectrum model

Info

Publication number: CN116539553A
Application number: CN202310519468.0A
Authority: CN
Inventors: 张翼鹏; 颜克亮; 凌军; 朱保昆; 陈微; 张伟; 曾仲大; 文里梁
Original assignee: China Tobacco Yunnan Industrial Co Ltd
Current assignee: China Tobacco Yunnan Industrial Co Ltd
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-04

Abstract

The invention discloses a method for improving the robustness of a near infrared spectrum model, which comprises the following steps: step 1: removing noise in a spectrum by adopting first-order derivation, and improving the signal-to-noise ratio of the spectrum and enhancing the division of overlapping peaks; step 2: the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected; step 3: an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability; step 4: selecting a characteristic variable from the processed spectrum data by adopting a random frog-leaping algorithm; step 5: and constructing a near infrared model by using the selected characteristic variables. The method can select fewer near infrared spectrum characteristic variables on the basis of not affecting modeling accuracy, and can effectively enhance the robustness of a near infrared model.

Description

Method for improving robustness of near infrared spectrum model

Technical Field

The invention discloses a method for improving the robustness of a near infrared spectrum model, belongs to the field of data science, and particularly relates to a method for selecting fewer spectrum characteristic signals to construct a more robust near infrared spectrum model.

Background

The near infrared technology is widely applied due to the advantages of rapidness, low cost, high precision and the like. However, the near infrared spectrum collects the spectrum information of the sample in a certain larger wavelength range, and only part of the spectrum of the wavelength expresses the characteristic information to be detected of the sample, so that the robustness of the near infrared spectrum model can be greatly improved if modeling can be performed only by using the wavelength spectrums which can express or sufficiently express the characteristic information of the detected sample.

If the near infrared spectral feature variables used for modeling are selected improperly, the robustness of the near infrared spectrum is affected. For example: if the selected characteristic variables are too many, the near infrared spectrum model is affected by a noise characteristic spectrum which cannot characterize the characteristic information to be detected, so that the model is under-fitted; if fewer feature variables are selected, then the near infrared spectral model is over-fitted because some spectral feature influencing factors are not considered.

The invention is proposed for this purpose.

Disclosure of Invention

The invention aims to provide a method for improving the robustness variable screening of a near infrared spectrum model. The method for improving the robustness of the near infrared spectrum model is an SSAS (Savitzky golay+ Multiple Scatter Correction +auto scaling+ Shuffled frog leaping algorithm) method. The method adopts first-order derivation to remove noise in the spectrum, improves the signal-to-noise ratio of the spectrum and enhances the division of overlapping peaks; the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected; an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability; and finally, selecting a near infrared spectrum modeling variable by adopting a random frog-leaping algorithm. According to the invention, on the basis of not affecting modeling accuracy, fewer near infrared spectrum characteristic variables are selected, and the robustness of the near infrared model can be effectively enhanced.

The invention adopts the technical scheme that:

a method for improving the robustness of a near infrared spectrum model, comprising the steps of: after the near infrared spectrum is properly preprocessed by adopting first-order derivation, multi-element scattering correction and maximum and minimum rules, a near infrared spectrum characteristic is selected by adopting a random frog-leaping method to construct a near infrared spectrum model; the method specifically comprises the following steps:

step 1: removing noise in a spectrum by adopting first-order derivation, and improving the signal-to-noise ratio of the spectrum and enhancing the division of overlapping peaks;

step 2: the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected;

step 3: an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability;

step 4: selecting a characteristic variable from the processed spectrum data by adopting a random frog-leaping algorithm;

step 5: and constructing a near infrared model by using the selected characteristic variables.

Preferably, the specific method of step 1 is as follows: the standard normal variable change is used for eliminating near infrared data affected by near infrared diffuse reflection, a first-order derivative method is adopted for carrying out smooth filtering on near infrared spectrum data, and interference of noise data is reduced; first order derivation method adoptedIs an improvement based on a mobile smoothing algorithm, wherein the solution of a matrix operator is specifically as follows: setting the filter window length n=2m+1, and measuring points in the window as x= (-m, -m+1, …, -1,0,1, …, m-1, m), fitting the n data points by using a k-1 (k < n) th order polynomial shown in the following formula, and f (x) =a ₀ +a ₁ x+a ₂ x ² +…+a _k-1 x ^k-1 The method comprises the steps of carrying out a first treatment on the surface of the For n points in the window, a k-element linear equation set consisting of n equations is formed, and the parameter A= { a of the polynomial is determined through least square fitting ₀ ,a ₁ ,…,a _k-1 And processing the spectral data using the multiple forms to eliminate noise interference of the spectral data.

Preferably, the specific method of step 2 is as follows:

(1) For each wavelength point of spectrum data for modeling, a corresponding average value is obtained, an ideal spectrum is constructed, and the calculation formula is as follows:wherein (1)>J's epsilon {1,2, …, m } eigenvalues representing "ideal spectrum", the eigenvalues of the near infrared spectrum of m, n being the number of near infrared spectra used for modeling; spec Spec _ij For the ith e {1,2, …, n } strip spectrum Spec _i J e {1,2, …, m } eigenvalues;

(2) Based on each spectral data Spec for modeling _i i.e {1,2, …, n } and "ideal spectrum"Performing unitary linear regression to obtain each spectrum Spec for modeling _i And the ideal spectrum>The regression results are shown in the following formula: />Wherein k is _i And b _i I < th > e {1,2, …, n } strip spectrum Spec, respectively _i And the ideal spectrum>A baseline shift amount and an offset amount from the unitary linear regression;

(3) Based on the baseline shift and offset obtained in step (2), each spectral data Spec for modeling is separately obtained _i i.e {1,2, …, n } is corrected as follows:wherein Spec is _i(MSC) For near infrared spectrum data Spec _i i.e {1,2, …, n } is corrected for spectral data by multivariate scattering.

The purpose of carrying out scattering correction on a plurality of near infrared spectrums for constructing a near infrared spectrum model is to correct the phenomenon of translation and offset of a spectrum data base line and eliminate spectrum differences among spectrums, which are caused by different experimental conditions and the like.

Preferably, the specific method of step 3 is as follows: the automatic scaling method is used as follows:wherein x is _i The absorbance of the ith wave number of the near infrared spectrum to be treated is that n is the characteristic variable number of the near infrared spectrum, x' _i ∈[0,1](i.epsilon. {1,2, …, n }) dimensionless,/i ∈>Absorbance mean of near infrared spectrum data, +.>The standard deviation of absorbance of the external spectrum data is the final { x' _i And (i.e {1,2, …, n }) is the pre-processed near infrared spectrum data.

The present invention uses an automatic scaling method to eliminate the dimension of the spectra to enhance the comparability between the spectra.

Preferably, the specific method of step 4 is as follows: randomly generating an initial set of variables of the near infrared spectrum containing Q epsilon {1,2, …, n } (n is the length of the near infrared spectrum), denoted as V ₀ Wherein the length of the n near infrared spectrum; assuming that the current iteration number is i= {0,1,2, … }, the spectral eigenvalue number of this iteration is Q _i The near infrared spectrum characteristic variable set is marked as V _i Iterating according to the following steps;

(a) According to N (Q) _i ,θ×Q _i ) Generates a random number rand from the probability distribution of (a) _i Record Q _i+1 ＝[rand ⁱ ]Wherein θ is a value of [0,1 ]]Positive real numbers within the range, N (Q _i ,θ×Q _i ) Can ensure that when the characteristic variable number Q is selected _i When larger, Q _i+1 And Q is equal to _i The greater the likelihood of a larger value difference; conversely, Q _i+1 And Q is equal to _i The greater the likelihood of a value difference, the less;

(b) If Q _i+1 ＝Q _i V is then _i+1 ＝V _i The method comprises the steps of carrying out a first treatment on the surface of the If Q _i+1 ＜Q _i Then utilize the characteristic variable set V of the spectrum data _i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q _i+1 The individual characteristic variables form a characteristic variable set V _i+1 The method comprises the steps of carrying out a first treatment on the surface of the If Q _i+1 ＞Q _i Then from the set V-V _i W (Q) _i+1 -Q) feature variables, denoted W _i Where V is the set of all spectral features, w > 1, and when w (Q _i+1 -Q) > n-Q, W _i ＝V-V _i Using the set of characteristic variables V of the spectral data _i +W _i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q _i+1 The individual characteristic variables form a characteristic variable set V _i+1 ；

(c) Repeating the above steps until k times of circulation to obtain k+1 spectrum characteristic feature sets V _A ＝{V ₀ ,V ₁ ,V ₂ ,…,V _k -a }; calculating each spectral feature v _i (i.epsilon. {1,2, …, n }) at V _A The frequency of occurrence of (a) is denoted as p _i Selecting p therein _i ≥p(p∈[0,1]) As a set of characteristic spectra that are ultimately used for near infrared modeling.

The invention has the beneficial effects that:

compared with the prior art, the method for improving the robustness of the near infrared spectrum model can select fewer near infrared spectrum characteristic variables for modeling on the basis of not affecting modeling accuracy, and can effectively enhance the robustness of the near infrared spectrum model.

Drawings

FIG. 1 is a schematic diagram of the steps of the method for improving the robustness of a near infrared spectrum model according to the present invention.

FIG. 2 is a plot of raw data of near infrared spectra used for model construction in the examples.

FIG. 3 is a plot of raw data of near infrared spectrum for model verification in an example.

FIG. 4 is a plot of near infrared spectrum data used for model construction after first order derivation to remove noise from the spectrum in the example.

FIG. 5 is a plot of near infrared spectral data used in the modeling of the example after removal of noise from the spectrum by first order derivation and elimination of the level of due scatter by multiple scatter correction.

FIG. 6 is a plot of near infrared spectral data used in the model construction of the example, after removing noise in the spectrum by first order derivation, eliminating the level of scattering due to multi-component scattering correction, and eliminating the dimension of the spectrum by an automatic scaling method.

FIG. 7 shows the distribution of the number of spectral feature variables selected according to different near infrared spectrum variable selection schemes in the embodiment.

FIG. 8 is a plot of near infrared spectral data for model verification after removal of noise from the spectrum by first order derivation, according to an embodiment.

FIG. 9 is a plot of near infrared spectral data for model verification in an embodiment, after first order derivation to remove noise from the spectrum, and multiple scatter correction to eliminate due scatter levels.

FIG. 10 is a plot of near infrared spectral data for model verification in an embodiment, after removing noise in the spectrum by first order derivation, eliminating the level of scattering due to multi-component scattering correction, and eliminating the dimension of the spectrum by an automatic scaling method.

FIG. 11 shows a model Q constructed according to various near infrared spectrum variable selection schemes ² A distribution situation map.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and specific examples:

examples

Data: the near infrared spectrum data of 655 flue-cured tobacco samples in different areas, parts and varieties are adopted. Uniformly selecting 524 sample data as a model training set according to a PCA Score matrix by adopting a Kennerd-Stone (PCA-Score, KS) algorithm based on the PCA Score, as shown in FIG. 2; the remaining 131 sample data were used as a model test set, as shown in fig. 3.

The selected 524 sample data is processed as a model training set case by the following steps:

after the near infrared spectrum is properly preprocessed by adopting first-order derivation, multi-element scattering correction and maximum and minimum rules, a near infrared spectrum characteristic is selected by adopting a random frog-leaping method to construct a near infrared spectrum model; the method specifically comprises the following steps:

The specific method of the step 1 is as follows: the standard normal variable change is used for eliminating near infrared data affected by near infrared diffuse reflection, a first-order derivative method is adopted for carrying out smooth filtering on near infrared spectrum data, and interference of noise data is reduced; the first-order derivation method is an improvement based on a mobile smoothing algorithm, wherein the solution of a matrix operator is specifically as follows: setting the filter window length n=2m+1, and measuring points in the window as x= (-m, -m+1, …, -1,0,1, …, m-1, m), fitting the n data points by using a k-1 (k < n) th order polynomial shown in the following formula, and f (x) =a ₀ +a ₁ x+a ₂ x ² +…+a _k-1 x ^k-1 The method comprises the steps of carrying out a first treatment on the surface of the For n points in the window, a k-element linear equation set consisting of n equations is formed, and the parameter A= { a of the polynomial is determined through least square fitting ₀ ,a ₁ ,…,a _k-1 And processing the spectral data using the multiple forms to eliminate noise interference of the spectral data.

The specific method in the step 2 is as follows:

The specific method in the step 3 is as follows: the automatic scaling method is used as follows:wherein x is _i The absorbance of the ith wave number of the near infrared spectrum to be treated is that n is the characteristic variable number of the near infrared spectrum, x _i '∈[0,1](i.epsilon. {1,2, …, n }) dimensionless,/i ∈>Absorbance mean of near infrared spectrum data, +.>Is the standard deviation of absorbance of the external spectrum data, and finally { x } _i ' i epsilon {1,2, …, n } is the pre-processed near infrared spectral data.

The specific method in the step 4 is as follows: randomly generating an initial set of variables of the near infrared spectrum containing Q epsilon {1,2, …, n } (n is the length of the near infrared spectrum), denoted as V ₀ Wherein the length of the n near infrared spectrum; assuming that the current iteration number is i= {0,1,2, … }, the spectral eigenvalue number of this iteration is Q _i The near infrared spectrum characteristic variable set is marked as V _i Iterating according to the following steps;

(a) According to N (Q) _i ,θ×Q _i ) Generates a random number rand from the probability distribution of (a) _i Record Q _i+1 ＝[rand _i ]Wherein θ is a value of [0,1 ]]Positive real numbers within the range, N (Q _i ,θ×Q _i ) Can ensure that when the characteristic variable number Q is selected _i When larger, Q _i+1 And Q is equal to _i The greater the likelihood of a larger value difference; conversely, Q _i+1 And Q is equal to _i The greater the likelihood of a value difference, the less;

(b) If Q _i+1 ＝Q _i V is then _i+1 ＝V _i The method comprises the steps of carrying out a first treatment on the surface of the If Q _i+1 ＜Q _i Then utilize the characteristic variable set V of the spectrum data _i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q _i+1 The individual characteristic variables form a characteristic variable set V _i+1 The method comprises the steps of carrying out a first treatment on the surface of the If Q _i+1 ＞Q _i Then from the set V-V _i W (Q) _i+1 -Q) feature variables, denoted W _i Where V is the set of all spectral features, w > 1, and when w (Q _i+1 -Q) > n-Q, W _i ＝V-V _i Using the set of characteristic variables V of the spectral data _i +W _i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q _i+1 Individual characteristic variablesConstitute feature variable set V _i+1 ；

The results of 524 modeling spectral data after removing noise from the spectrum by first order derivation are shown in fig. 4. The results of 524 modeling spectral data after removing noise in the spectrum by first order derivation and eliminating the due scatter level by multivariate scatter correction are shown in fig. 5. The results of 524 modeling spectrum data after removing noise in the spectrum, correcting and removing scattering level due to multi-component scattering through first-order derivation and removing spectrum dimension through an automatic scaling method are shown in fig. 6.

After removing noise in the spectrum, correcting and eliminating the scattering level due to multi-element scattering, and eliminating the dimension of the spectrum by an automatic scaling method through first-order derivation, adopting a spectral feature screening scheme shown in table 1, and selecting different spectral feature values.

Table 1 different near infrared spectral signature screening schemes

For the 524 model training sets spectrum data for modeling, after all near infrared spectrum data are properly preprocessed through first-order derivation, multi-element scattering correction and maximum and minimum rules, 8 schemes shown in table 1 are adopted to screen spectrum characteristics, and the screened results are shown in table 2 and fig. 7. It can be seen that the near infrared spectrum data screened by the SSAS characteristic screening method provided by the invention is obviously lower than the near infrared spectrum characteristic quantity screened by other methods.

And (3) respectively constructing a total sugar, reducing sugar, total nitrogen, potassium, chlorine and nicotine model based on the spectral characteristics screened by 8 different screening schemes.

TABLE 2 quantity of spectral feature variables screened with different schemes

	Total sugar	Reducing sugar	Total nitrogen	Potassium	Chlorine	Nicotine	Average of
								N-Selection	1557	1557	1557	1557	1557	1557	1557
UVE	581	704	785	579	574	634	643
								MC-UVE	481	493	742	527	565	439	541
SSAS	213	116	244	456	558	44	272
								Range	678	481	754	1493	845	754	834
Range-UVE	882	899	1003	1525	996	985	1048
								Range-MC-UVE	823	774	980	1527	968	893	994
Range-SSAS	781	601	858	1531	1060	781	935

The spectral characteristics pre-screening method shown in Table 2 is adopted to screen the near infrared spectral data characteristics, and the total sugar, the reducing sugar, the total nitrogen, the potassium, the chlorine and the nicotine models are respectively constructed.

And carrying out first-order derivation, multi-element scattering correction and automatic scaling pretreatment on the 131 model test data. The result of the 131 model verification spectrum data after the noise in the spectrum is removed through the first order derivative is shown in fig. 8. The results of the 131 model verification spectrum data after removing noise and multivariate scattering correction in the spectrum and eliminating the level of the due scattering by first order derivation are shown in fig. 9. The results of the 131 model verification spectrum data after removing the noise in the spectrum, the multi-element scattering correction and elimination of the scattering level and the automatic scaling method and eliminating the spectrum dimension are shown in fig. 10. The constructed total sugar, reducing sugar, total nitrogen, potassium, chlorine and nicotine models are evaluated by finally using 131 model tester spectrum data, and finally Q of all models is obtained ² The index statistical analysis is shown in table 3 and fig. 11.

TABLE 3 evaluation of all near infrared models (Q ² )

	Total sugar	Reducing sugar	Total nitrogen	Potassium	Chlorine	Nicotine	Average of
								N-Selection	0.956	0.925	0.79	0.931	0.953	0.81	0.894
UVE	0.963	0.926	0.788	0.945	0.958	0.821	0.900
								MC-UVE	0.964	0.927	0.793	0.953	0.955	0.821	0.902
SSAS	0.961	0.930	0.781	0.959	0.944	0.829	0.901
								Range	0.965	0.916	0.814	0.921	0.96	0.81	0.898
Range-UVE	0.963	0.928	0.792	0.931	0.954	0.826	0.899
								Range-MC-UVE	0.964	0.923	0.800	0.931	0.952	0.822	0.899
Range-SSAS	0.964	0.929	0.789	0.932	0.942	0.820	0.896
								Statistics (CV)	0.31％	0.43％	1.26％	1.39％	0.63％	0.85％	--

The specific definition of the model evaluation index Q2 in table 3 is shown in the following formula.

Wherein: n is the number of samples for model verification (n=131 in this embodiment), pre _i Model predictive value, act, for the ith sample _i For the actual value of the i-th sample,is the average of the actual values of all samples.

The statistical value (CV) in Table 3 is a coefficient of variation, and is specifically defined as shown in the following formula.

Wherein sigma is a standard deviation average value, mu is an average value, and CV values represent the size fluctuation condition among data.

As shown in Table 3, after near infrared feature screening is performed by adopting different schemes, the difference of model evaluation values constructed for tobacco leaf substances such as total sugar, reducing sugar, total nitrogen, potassium, chlorine and nicotine in tobacco leaf samples is very small (CV is less than or equal to 3%), and even in some tobacco leaf substance prediction models, the difference of modeling result evaluation values of different spectral feature vector screening schemes is less than 1%, which indicates that the method (SSAS) for improving the robustness of the near infrared spectrum model does not influence the model quality of the near infrared spectrum.

The number of spectral feature variables screened in the different models using the different spectral data feature value mass spectrum screening schemes shown in table 1 is shown in table 2. As shown in Table 2, the number of spectral feature variables screened by the method (SSAS) for improving the robustness of the near infrared spectrum model is obviously smaller than that of other feature variables under the condition of not influencing the modeling effect.

In summary, the method (SSAS) of the present invention does not affect the quality of the near infrared spectrum model, and the near infrared spectrum model is more robust because fewer spectral feature variables are selected.

The examples are given solely for the preferred embodiments of the present invention and are not intended to limit the invention thereto, since various modifications and variations will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for improving the robustness of a near infrared spectrum model, comprising the steps of:

2. The method according to claim 1, wherein the specific method of step 1 is as follows: the standard normal variable change is used for eliminating near infrared data affected by near infrared diffuse reflection, a first-order derivative method is adopted for carrying out smooth filtering on near infrared spectrum data, and interference of noise data is reduced; the first-order derivation method is an improvement based on a mobile smoothing algorithm, wherein the solution of a matrix operator is specifically as follows: setting the filter window length n=2m+1, and measuring points in the window as x= (-m, -m+1, …, -1,0,1, …, m-1, m), fitting the n data points by using a k-1 (k < n) th order polynomial shown in the following formula, and f (x) =a ₀ +a ₁ x+a ₂ x ² +…+a _k-1 x ^k-1 The method comprises the steps of carrying out a first treatment on the surface of the For n points in the window, a k-element linear equation set consisting of n equations is formed, and the parameter A= { a of the polynomial is determined through least square fitting ₀ ,a ₁ ,…,a _k-1 And use the polymorphic pairsThe spectral data is processed to eliminate noise interference of the spectral data.

3. The method according to claim 1, wherein the specific method of step 2 is as follows:

4. The method according to claim 1, wherein the specific method of step 3 is as follows: the automatic scaling method is used as follows:wherein x is _i The absorbance of the ith wave number of the near infrared spectrum to be treated is that n is the characteristic variable number of the near infrared spectrum, x' _i ∈[0,1](i.epsilon. {1,2, …, n }) dimensionless,/i ∈>Is the absorbance average value of the near infrared spectrum data,the standard deviation of absorbance of the external spectrum data is the final { x' _i And (i.e {1,2, …, n }) is the pre-processed near infrared spectrum data.

5. The method according to claim 1, wherein the specific method of step 4 is as follows: randomly generating an initial set of variables of the near infrared spectrum containing Q epsilon {1,2, …, n } (n is the length of the near infrared spectrum), denoted as V ₀ Wherein the length of the n near infrared spectrum; assuming that the current iteration number is i= {0,1,2, … }, the spectral eigenvalue number of this iteration is Q _i The near infrared spectrum characteristic variable set is marked as V _i Iterating according to the following steps;