CN116539553A - Method for improving robustness of near infrared spectrum model - Google Patents
Method for improving robustness of near infrared spectrum model Download PDFInfo
- Publication number
- CN116539553A CN116539553A CN202310519468.0A CN202310519468A CN116539553A CN 116539553 A CN116539553 A CN 116539553A CN 202310519468 A CN202310519468 A CN 202310519468A CN 116539553 A CN116539553 A CN 116539553A
- Authority
- CN
- China
- Prior art keywords
- spectrum
- near infrared
- data
- infrared spectrum
- characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000002329 infrared spectrum Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000001228 spectrum Methods 0.000 claims abstract description 113
- 238000009795 derivation Methods 0.000 claims abstract description 23
- 238000012937 correction Methods 0.000 claims abstract description 19
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 238000013519 translation Methods 0.000 claims abstract description 7
- 230000002708 enhancing effect Effects 0.000 claims abstract description 4
- 230000003595 spectral effect Effects 0.000 claims description 51
- 238000002835 absorbance Methods 0.000 claims description 9
- 238000012417 linear regression Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 description 9
- IJGRMHOSHXDMSA-UHFFFAOYSA-N Atomic nitrogen Chemical compound N#N IJGRMHOSHXDMSA-UHFFFAOYSA-N 0.000 description 8
- 238000012795 verification Methods 0.000 description 8
- SNICXCGAKADSCV-JTQLQIEISA-N (-)-Nicotine Chemical compound CN1CCC[C@H]1C1=CC=CN=C1 SNICXCGAKADSCV-JTQLQIEISA-N 0.000 description 4
- ZAMOUSCENKQFHK-UHFFFAOYSA-N Chlorine atom Chemical compound [Cl] ZAMOUSCENKQFHK-UHFFFAOYSA-N 0.000 description 4
- 241000208125 Nicotiana Species 0.000 description 4
- 235000002637 Nicotiana tabacum Nutrition 0.000 description 4
- ZLMJMSJWJFRBEC-UHFFFAOYSA-N Potassium Chemical compound [K] ZLMJMSJWJFRBEC-UHFFFAOYSA-N 0.000 description 4
- 239000000460 chlorine Substances 0.000 description 4
- 229910052801 chlorine Inorganic materials 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 229960002715 nicotine Drugs 0.000 description 4
- SNICXCGAKADSCV-UHFFFAOYSA-N nicotine Natural products CN1CCCC1C1=CC=CN=C1 SNICXCGAKADSCV-UHFFFAOYSA-N 0.000 description 4
- 229910052757 nitrogen Inorganic materials 0.000 description 4
- 239000011591 potassium Substances 0.000 description 4
- 229910052700 potassium Inorganic materials 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- -1 nitrogen Potassium Chlorine Nicotine Chemical compound 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001819 mass spectrum Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N21/00—Investigating or analysing materials by the use of optical means, i.e. using sub-millimetre waves, infrared, visible or ultraviolet light
- G01N21/17—Systems in which incident light is modified in accordance with the properties of the material investigated
- G01N21/25—Colour; Spectral properties, i.e. comparison of effect of material on the light at two or more different wavelengths or wavelength bands
- G01N21/31—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry
- G01N21/35—Investigating relative effect of material at wavelengths characteristic of specific elements or molecules, e.g. atomic absorption spectrometry using infrared light
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/34—Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/58—Extraction of image or video features relating to hyperspectral data
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Biochemistry (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Immunology (AREA)
- Pathology (AREA)
- Chemical & Material Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Investigating Or Analysing Materials By Optical Means (AREA)
Abstract
The invention discloses a method for improving the robustness of a near infrared spectrum model, which comprises the following steps: step 1: removing noise in a spectrum by adopting first-order derivation, and improving the signal-to-noise ratio of the spectrum and enhancing the division of overlapping peaks; step 2: the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected; step 3: an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability; step 4: selecting a characteristic variable from the processed spectrum data by adopting a random frog-leaping algorithm; step 5: and constructing a near infrared model by using the selected characteristic variables. The method can select fewer near infrared spectrum characteristic variables on the basis of not affecting modeling accuracy, and can effectively enhance the robustness of a near infrared model.
Description
Technical Field
The invention discloses a method for improving the robustness of a near infrared spectrum model, belongs to the field of data science, and particularly relates to a method for selecting fewer spectrum characteristic signals to construct a more robust near infrared spectrum model.
Background
The near infrared technology is widely applied due to the advantages of rapidness, low cost, high precision and the like. However, the near infrared spectrum collects the spectrum information of the sample in a certain larger wavelength range, and only part of the spectrum of the wavelength expresses the characteristic information to be detected of the sample, so that the robustness of the near infrared spectrum model can be greatly improved if modeling can be performed only by using the wavelength spectrums which can express or sufficiently express the characteristic information of the detected sample.
If the near infrared spectral feature variables used for modeling are selected improperly, the robustness of the near infrared spectrum is affected. For example: if the selected characteristic variables are too many, the near infrared spectrum model is affected by a noise characteristic spectrum which cannot characterize the characteristic information to be detected, so that the model is under-fitted; if fewer feature variables are selected, then the near infrared spectral model is over-fitted because some spectral feature influencing factors are not considered.
The invention is proposed for this purpose.
Disclosure of Invention
The invention aims to provide a method for improving the robustness variable screening of a near infrared spectrum model. The method for improving the robustness of the near infrared spectrum model is an SSAS (Savitzky golay+ Multiple Scatter Correction +auto scaling+ Shuffled frog leaping algorithm) method. The method adopts first-order derivation to remove noise in the spectrum, improves the signal-to-noise ratio of the spectrum and enhances the division of overlapping peaks; the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected; an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability; and finally, selecting a near infrared spectrum modeling variable by adopting a random frog-leaping algorithm. According to the invention, on the basis of not affecting modeling accuracy, fewer near infrared spectrum characteristic variables are selected, and the robustness of the near infrared model can be effectively enhanced.
The invention adopts the technical scheme that:
a method for improving the robustness of a near infrared spectrum model, comprising the steps of: after the near infrared spectrum is properly preprocessed by adopting first-order derivation, multi-element scattering correction and maximum and minimum rules, a near infrared spectrum characteristic is selected by adopting a random frog-leaping method to construct a near infrared spectrum model; the method specifically comprises the following steps:
step 1: removing noise in a spectrum by adopting first-order derivation, and improving the signal-to-noise ratio of the spectrum and enhancing the division of overlapping peaks;
step 2: the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected;
step 3: an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability;
step 4: selecting a characteristic variable from the processed spectrum data by adopting a random frog-leaping algorithm;
step 5: and constructing a near infrared model by using the selected characteristic variables.
Preferably, the specific method of step 1 is as follows: the standard normal variable change is used for eliminating near infrared data affected by near infrared diffuse reflection, a first-order derivative method is adopted for carrying out smooth filtering on near infrared spectrum data, and interference of noise data is reduced; first order derivation method adoptedIs an improvement based on a mobile smoothing algorithm, wherein the solution of a matrix operator is specifically as follows: setting the filter window length n=2m+1, and measuring points in the window as x= (-m, -m+1, …, -1,0,1, …, m-1, m), fitting the n data points by using a k-1 (k < n) th order polynomial shown in the following formula, and f (x) =a 0 +a 1 x+a 2 x 2 +…+a k-1 x k-1 The method comprises the steps of carrying out a first treatment on the surface of the For n points in the window, a k-element linear equation set consisting of n equations is formed, and the parameter A= { a of the polynomial is determined through least square fitting 0 ,a 1 ,…,a k-1 And processing the spectral data using the multiple forms to eliminate noise interference of the spectral data.
Preferably, the specific method of step 2 is as follows:
(1) For each wavelength point of spectrum data for modeling, a corresponding average value is obtained, an ideal spectrum is constructed, and the calculation formula is as follows:wherein (1)>J's epsilon {1,2, …, m } eigenvalues representing "ideal spectrum", the eigenvalues of the near infrared spectrum of m, n being the number of near infrared spectra used for modeling; spec Spec ij For the ith e {1,2, …, n } strip spectrum Spec i J e {1,2, …, m } eigenvalues;
(2) Based on each spectral data Spec for modeling i i.e {1,2, …, n } and "ideal spectrum"Performing unitary linear regression to obtain each spectrum Spec for modeling i And the ideal spectrum>The regression results are shown in the following formula: />Wherein k is i And b i I < th > e {1,2, …, n } strip spectrum Spec, respectively i And the ideal spectrum>A baseline shift amount and an offset amount from the unitary linear regression;
(3) Based on the baseline shift and offset obtained in step (2), each spectral data Spec for modeling is separately obtained i i.e {1,2, …, n } is corrected as follows:wherein Spec is i(MSC) For near infrared spectrum data Spec i i.e {1,2, …, n } is corrected for spectral data by multivariate scattering.
The purpose of carrying out scattering correction on a plurality of near infrared spectrums for constructing a near infrared spectrum model is to correct the phenomenon of translation and offset of a spectrum data base line and eliminate spectrum differences among spectrums, which are caused by different experimental conditions and the like.
Preferably, the specific method of step 3 is as follows: the automatic scaling method is used as follows:wherein x is i The absorbance of the ith wave number of the near infrared spectrum to be treated is that n is the characteristic variable number of the near infrared spectrum, x' i ∈[0,1](i.epsilon. {1,2, …, n }) dimensionless,/i ∈>Absorbance mean of near infrared spectrum data, +.>The standard deviation of absorbance of the external spectrum data is the final { x' i And (i.e {1,2, …, n }) is the pre-processed near infrared spectrum data.
The present invention uses an automatic scaling method to eliminate the dimension of the spectra to enhance the comparability between the spectra.
Preferably, the specific method of step 4 is as follows: randomly generating an initial set of variables of the near infrared spectrum containing Q epsilon {1,2, …, n } (n is the length of the near infrared spectrum), denoted as V 0 Wherein the length of the n near infrared spectrum; assuming that the current iteration number is i= {0,1,2, … }, the spectral eigenvalue number of this iteration is Q i The near infrared spectrum characteristic variable set is marked as V i Iterating according to the following steps;
(a) According to N (Q) i ,θ×Q i ) Generates a random number rand from the probability distribution of (a) i Record Q i+1 =[rand i ]Wherein θ is a value of [0,1 ]]Positive real numbers within the range, N (Q i ,θ×Q i ) Can ensure that when the characteristic variable number Q is selected i When larger, Q i+1 And Q is equal to i The greater the likelihood of a larger value difference; conversely, Q i+1 And Q is equal to i The greater the likelihood of a value difference, the less;
(b) If Q i+1 =Q i V is then i+1 =V i The method comprises the steps of carrying out a first treatment on the surface of the If Q i+1 <Q i Then utilize the characteristic variable set V of the spectrum data i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q i+1 The individual characteristic variables form a characteristic variable set V i+1 The method comprises the steps of carrying out a first treatment on the surface of the If Q i+1 >Q i Then from the set V-V i W (Q) i+1 -Q) feature variables, denoted W i Where V is the set of all spectral features, w > 1, and when w (Q i+1 -Q) > n-Q, W i =V-V i Using the set of characteristic variables V of the spectral data i +W i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q i+1 The individual characteristic variables form a characteristic variable set V i+1 ;
(c) Repeating the above steps until k times of circulation to obtain k+1 spectrum characteristic feature sets V A ={V 0 ,V 1 ,V 2 ,…,V k -a }; calculating each spectral feature v i (i.epsilon. {1,2, …, n }) at V A The frequency of occurrence of (a) is denoted as p i Selecting p therein i ≥p(p∈[0,1]) As a set of characteristic spectra that are ultimately used for near infrared modeling.
The invention has the beneficial effects that:
compared with the prior art, the method for improving the robustness of the near infrared spectrum model can select fewer near infrared spectrum characteristic variables for modeling on the basis of not affecting modeling accuracy, and can effectively enhance the robustness of the near infrared spectrum model.
Drawings
FIG. 1 is a schematic diagram of the steps of the method for improving the robustness of a near infrared spectrum model according to the present invention.
FIG. 2 is a plot of raw data of near infrared spectra used for model construction in the examples.
FIG. 3 is a plot of raw data of near infrared spectrum for model verification in an example.
FIG. 4 is a plot of near infrared spectrum data used for model construction after first order derivation to remove noise from the spectrum in the example.
FIG. 5 is a plot of near infrared spectral data used in the modeling of the example after removal of noise from the spectrum by first order derivation and elimination of the level of due scatter by multiple scatter correction.
FIG. 6 is a plot of near infrared spectral data used in the model construction of the example, after removing noise in the spectrum by first order derivation, eliminating the level of scattering due to multi-component scattering correction, and eliminating the dimension of the spectrum by an automatic scaling method.
FIG. 7 shows the distribution of the number of spectral feature variables selected according to different near infrared spectrum variable selection schemes in the embodiment.
FIG. 8 is a plot of near infrared spectral data for model verification after removal of noise from the spectrum by first order derivation, according to an embodiment.
FIG. 9 is a plot of near infrared spectral data for model verification in an embodiment, after first order derivation to remove noise from the spectrum, and multiple scatter correction to eliminate due scatter levels.
FIG. 10 is a plot of near infrared spectral data for model verification in an embodiment, after removing noise in the spectrum by first order derivation, eliminating the level of scattering due to multi-component scattering correction, and eliminating the dimension of the spectrum by an automatic scaling method.
FIG. 11 shows a model Q constructed according to various near infrared spectrum variable selection schemes 2 A distribution situation map.
Detailed Description
The invention is described in further detail below with reference to the attached drawings and specific examples:
examples
Data: the near infrared spectrum data of 655 flue-cured tobacco samples in different areas, parts and varieties are adopted. Uniformly selecting 524 sample data as a model training set according to a PCA Score matrix by adopting a Kennerd-Stone (PCA-Score, KS) algorithm based on the PCA Score, as shown in FIG. 2; the remaining 131 sample data were used as a model test set, as shown in fig. 3.
The selected 524 sample data is processed as a model training set case by the following steps:
after the near infrared spectrum is properly preprocessed by adopting first-order derivation, multi-element scattering correction and maximum and minimum rules, a near infrared spectrum characteristic is selected by adopting a random frog-leaping method to construct a near infrared spectrum model; the method specifically comprises the following steps:
step 1: removing noise in a spectrum by adopting first-order derivation, and improving the signal-to-noise ratio of the spectrum and enhancing the division of overlapping peaks;
step 2: the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected;
step 3: an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability;
step 4: selecting a characteristic variable from the processed spectrum data by adopting a random frog-leaping algorithm;
step 5: and constructing a near infrared model by using the selected characteristic variables.
The specific method of the step 1 is as follows: the standard normal variable change is used for eliminating near infrared data affected by near infrared diffuse reflection, a first-order derivative method is adopted for carrying out smooth filtering on near infrared spectrum data, and interference of noise data is reduced; the first-order derivation method is an improvement based on a mobile smoothing algorithm, wherein the solution of a matrix operator is specifically as follows: setting the filter window length n=2m+1, and measuring points in the window as x= (-m, -m+1, …, -1,0,1, …, m-1, m), fitting the n data points by using a k-1 (k < n) th order polynomial shown in the following formula, and f (x) =a 0 +a 1 x+a 2 x 2 +…+a k-1 x k-1 The method comprises the steps of carrying out a first treatment on the surface of the For n points in the window, a k-element linear equation set consisting of n equations is formed, and the parameter A= { a of the polynomial is determined through least square fitting 0 ,a 1 ,…,a k-1 And processing the spectral data using the multiple forms to eliminate noise interference of the spectral data.
The specific method in the step 2 is as follows:
(1) For each wavelength point of spectrum data for modeling, a corresponding average value is obtained, an ideal spectrum is constructed, and the calculation formula is as follows:wherein (1)>J's epsilon {1,2, …, m } eigenvalues representing "ideal spectrum", the eigenvalues of the near infrared spectrum of m, n being the number of near infrared spectra used for modeling; spec Spec ij For the ith e {1,2, …, n } strip spectrum Spec i J e {1,2, …, m } eigenvalues;
(2) Based on each spectral data Spec for modeling i i.e {1,2, …, n } and "ideal spectrum"Performing unitary linear regression to obtain each spectrum Spec for modeling i And the ideal spectrum>The regression results are shown in the following formula: />Wherein k is i And b i I < th > e {1,2, …, n } strip spectrum Spec, respectively i And the ideal spectrum>A baseline shift amount and an offset amount from the unitary linear regression;
(3) Based on the baseline shift and offset obtained in step (2), each spectral data Spec for modeling is separately obtained i i.e {1,2, …, n } is corrected as follows:wherein Spec is i(MSC) For near infrared spectrum data Spec i i.e {1,2, …, n } is corrected for spectral data by multivariate scattering.
The purpose of carrying out scattering correction on a plurality of near infrared spectrums for constructing a near infrared spectrum model is to correct the phenomenon of translation and offset of a spectrum data base line and eliminate spectrum differences among spectrums, which are caused by different experimental conditions and the like.
The specific method in the step 3 is as follows: the automatic scaling method is used as follows:wherein x is i The absorbance of the ith wave number of the near infrared spectrum to be treated is that n is the characteristic variable number of the near infrared spectrum, x i '∈[0,1](i.epsilon. {1,2, …, n }) dimensionless,/i ∈>Absorbance mean of near infrared spectrum data, +.>Is the standard deviation of absorbance of the external spectrum data, and finally { x } i ' i epsilon {1,2, …, n } is the pre-processed near infrared spectral data.
The present invention uses an automatic scaling method to eliminate the dimension of the spectra to enhance the comparability between the spectra.
The specific method in the step 4 is as follows: randomly generating an initial set of variables of the near infrared spectrum containing Q epsilon {1,2, …, n } (n is the length of the near infrared spectrum), denoted as V 0 Wherein the length of the n near infrared spectrum; assuming that the current iteration number is i= {0,1,2, … }, the spectral eigenvalue number of this iteration is Q i The near infrared spectrum characteristic variable set is marked as V i Iterating according to the following steps;
(a) According to N (Q) i ,θ×Q i ) Generates a random number rand from the probability distribution of (a) i Record Q i+1 =[rand i ]Wherein θ is a value of [0,1 ]]Positive real numbers within the range, N (Q i ,θ×Q i ) Can ensure that when the characteristic variable number Q is selected i When larger, Q i+1 And Q is equal to i The greater the likelihood of a larger value difference; conversely, Q i+1 And Q is equal to i The greater the likelihood of a value difference, the less;
(b) If Q i+1 =Q i V is then i+1 =V i The method comprises the steps of carrying out a first treatment on the surface of the If Q i+1 <Q i Then utilize the characteristic variable set V of the spectrum data i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q i+1 The individual characteristic variables form a characteristic variable set V i+1 The method comprises the steps of carrying out a first treatment on the surface of the If Q i+1 >Q i Then from the set V-V i W (Q) i+1 -Q) feature variables, denoted W i Where V is the set of all spectral features, w > 1, and when w (Q i+1 -Q) > n-Q, W i =V-V i Using the set of characteristic variables V of the spectral data i +W i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q i+1 Individual characteristic variablesConstitute feature variable set V i+1 ;
(c) Repeating the above steps until k times of circulation to obtain k+1 spectrum characteristic feature sets V A ={V 0 ,V 1 ,V 2 ,…,V k -a }; calculating each spectral feature v i (i.epsilon. {1,2, …, n }) at V A The frequency of occurrence of (a) is denoted as p i Selecting p therein i ≥p(p∈[0,1]) As a set of characteristic spectra that are ultimately used for near infrared modeling.
The results of 524 modeling spectral data after removing noise from the spectrum by first order derivation are shown in fig. 4. The results of 524 modeling spectral data after removing noise in the spectrum by first order derivation and eliminating the due scatter level by multivariate scatter correction are shown in fig. 5. The results of 524 modeling spectrum data after removing noise in the spectrum, correcting and removing scattering level due to multi-component scattering through first-order derivation and removing spectrum dimension through an automatic scaling method are shown in fig. 6.
After removing noise in the spectrum, correcting and eliminating the scattering level due to multi-element scattering, and eliminating the dimension of the spectrum by an automatic scaling method through first-order derivation, adopting a spectral feature screening scheme shown in table 1, and selecting different spectral feature values.
Table 1 different near infrared spectral signature screening schemes
For the 524 model training sets spectrum data for modeling, after all near infrared spectrum data are properly preprocessed through first-order derivation, multi-element scattering correction and maximum and minimum rules, 8 schemes shown in table 1 are adopted to screen spectrum characteristics, and the screened results are shown in table 2 and fig. 7. It can be seen that the near infrared spectrum data screened by the SSAS characteristic screening method provided by the invention is obviously lower than the near infrared spectrum characteristic quantity screened by other methods.
And (3) respectively constructing a total sugar, reducing sugar, total nitrogen, potassium, chlorine and nicotine model based on the spectral characteristics screened by 8 different screening schemes.
TABLE 2 quantity of spectral feature variables screened with different schemes
Total sugar | Reducing sugar | Total nitrogen | Potassium | Chlorine | Nicotine | Average of | |
N-Selection | 1557 | 1557 | 1557 | 1557 | 1557 | 1557 | 1557 |
UVE | 581 | 704 | 785 | 579 | 574 | 634 | 643 |
MC-UVE | 481 | 493 | 742 | 527 | 565 | 439 | 541 |
SSAS | 213 | 116 | 244 | 456 | 558 | 44 | 272 |
Range | 678 | 481 | 754 | 1493 | 845 | 754 | 834 |
Range-UVE | 882 | 899 | 1003 | 1525 | 996 | 985 | 1048 |
Range-MC-UVE | 823 | 774 | 980 | 1527 | 968 | 893 | 994 |
Range-SSAS | 781 | 601 | 858 | 1531 | 1060 | 781 | 935 |
The spectral characteristics pre-screening method shown in Table 2 is adopted to screen the near infrared spectral data characteristics, and the total sugar, the reducing sugar, the total nitrogen, the potassium, the chlorine and the nicotine models are respectively constructed.
And carrying out first-order derivation, multi-element scattering correction and automatic scaling pretreatment on the 131 model test data. The result of the 131 model verification spectrum data after the noise in the spectrum is removed through the first order derivative is shown in fig. 8. The results of the 131 model verification spectrum data after removing noise and multivariate scattering correction in the spectrum and eliminating the level of the due scattering by first order derivation are shown in fig. 9. The results of the 131 model verification spectrum data after removing the noise in the spectrum, the multi-element scattering correction and elimination of the scattering level and the automatic scaling method and eliminating the spectrum dimension are shown in fig. 10. The constructed total sugar, reducing sugar, total nitrogen, potassium, chlorine and nicotine models are evaluated by finally using 131 model tester spectrum data, and finally Q of all models is obtained 2 The index statistical analysis is shown in table 3 and fig. 11.
TABLE 3 evaluation of all near infrared models (Q 2 )
Total sugar | Reducing sugar | Total nitrogen | Potassium | Chlorine | Nicotine | Average of | |
N-Selection | 0.956 | 0.925 | 0.79 | 0.931 | 0.953 | 0.81 | 0.894 |
UVE | 0.963 | 0.926 | 0.788 | 0.945 | 0.958 | 0.821 | 0.900 |
MC-UVE | 0.964 | 0.927 | 0.793 | 0.953 | 0.955 | 0.821 | 0.902 |
SSAS | 0.961 | 0.930 | 0.781 | 0.959 | 0.944 | 0.829 | 0.901 |
Range | 0.965 | 0.916 | 0.814 | 0.921 | 0.96 | 0.81 | 0.898 |
Range-UVE | 0.963 | 0.928 | 0.792 | 0.931 | 0.954 | 0.826 | 0.899 |
Range-MC-UVE | 0.964 | 0.923 | 0.800 | 0.931 | 0.952 | 0.822 | 0.899 |
Range-SSAS | 0.964 | 0.929 | 0.789 | 0.932 | 0.942 | 0.820 | 0.896 |
Statistics (CV) | 0.31% | 0.43% | 1.26% | 1.39% | 0.63% | 0.85% | -- |
The specific definition of the model evaluation index Q2 in table 3 is shown in the following formula.
Wherein: n is the number of samples for model verification (n=131 in this embodiment), pre i Model predictive value, act, for the ith sample i For the actual value of the i-th sample,is the average of the actual values of all samples.
The statistical value (CV) in Table 3 is a coefficient of variation, and is specifically defined as shown in the following formula.
Wherein sigma is a standard deviation average value, mu is an average value, and CV values represent the size fluctuation condition among data.
As shown in Table 3, after near infrared feature screening is performed by adopting different schemes, the difference of model evaluation values constructed for tobacco leaf substances such as total sugar, reducing sugar, total nitrogen, potassium, chlorine and nicotine in tobacco leaf samples is very small (CV is less than or equal to 3%), and even in some tobacco leaf substance prediction models, the difference of modeling result evaluation values of different spectral feature vector screening schemes is less than 1%, which indicates that the method (SSAS) for improving the robustness of the near infrared spectrum model does not influence the model quality of the near infrared spectrum.
The number of spectral feature variables screened in the different models using the different spectral data feature value mass spectrum screening schemes shown in table 1 is shown in table 2. As shown in Table 2, the number of spectral feature variables screened by the method (SSAS) for improving the robustness of the near infrared spectrum model is obviously smaller than that of other feature variables under the condition of not influencing the modeling effect.
In summary, the method (SSAS) of the present invention does not affect the quality of the near infrared spectrum model, and the near infrared spectrum model is more robust because fewer spectral feature variables are selected.
The examples are given solely for the preferred embodiments of the present invention and are not intended to limit the invention thereto, since various modifications and variations will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. A method for improving the robustness of a near infrared spectrum model, comprising the steps of:
step 1: removing noise in a spectrum by adopting first-order derivation, and improving the signal-to-noise ratio of the spectrum and enhancing the division of overlapping peaks;
step 2: the spectrum difference caused by different scattering levels is eliminated by adopting multi-element scattering correction, the correlation between spectrums is enhanced, and the baseline translation and offset phenomena of spectrum data are corrected;
step 3: an automatic scaling method is adopted to eliminate spectrum dimension and enhance data comparability;
step 4: selecting a characteristic variable from the processed spectrum data by adopting a random frog-leaping algorithm;
step 5: and constructing a near infrared model by using the selected characteristic variables.
2. The method according to claim 1, wherein the specific method of step 1 is as follows: the standard normal variable change is used for eliminating near infrared data affected by near infrared diffuse reflection, a first-order derivative method is adopted for carrying out smooth filtering on near infrared spectrum data, and interference of noise data is reduced; the first-order derivation method is an improvement based on a mobile smoothing algorithm, wherein the solution of a matrix operator is specifically as follows: setting the filter window length n=2m+1, and measuring points in the window as x= (-m, -m+1, …, -1,0,1, …, m-1, m), fitting the n data points by using a k-1 (k < n) th order polynomial shown in the following formula, and f (x) =a 0 +a 1 x+a 2 x 2 +…+a k-1 x k-1 The method comprises the steps of carrying out a first treatment on the surface of the For n points in the window, a k-element linear equation set consisting of n equations is formed, and the parameter A= { a of the polynomial is determined through least square fitting 0 ,a 1 ,…,a k-1 And use the polymorphic pairsThe spectral data is processed to eliminate noise interference of the spectral data.
3. The method according to claim 1, wherein the specific method of step 2 is as follows:
(1) For each wavelength point of spectrum data for modeling, a corresponding average value is obtained, an ideal spectrum is constructed, and the calculation formula is as follows:wherein (1)>J's epsilon {1,2, …, m } eigenvalues representing "ideal spectrum", the eigenvalues of the near infrared spectrum of m, n being the number of near infrared spectra used for modeling; spec Spec ij For the ith e {1,2, …, n } strip spectrum Spec i J e {1,2, …, m } eigenvalues;
(2) Based on each spectral data Spec for modeling i i.e {1,2, …, n } and "ideal spectrum"Performing unitary linear regression to obtain each spectrum Spec for modeling i And the ideal spectrum>The regression results are shown in the following formula: />Wherein k is i And b i I < th > e {1,2, …, n } strip spectrum Spec, respectively i And the ideal spectrum>A baseline shift amount and an offset amount from the unitary linear regression;
(3) Based on the baseline shift and offset obtained in step (2), each spectral data Spec for modeling is separately obtained i i.e {1,2, …, n } is corrected as follows:wherein Spec is i(MSC) For near infrared spectrum data Spec i i.e {1,2, …, n } is corrected for spectral data by multivariate scattering.
4. The method according to claim 1, wherein the specific method of step 3 is as follows: the automatic scaling method is used as follows:wherein x is i The absorbance of the ith wave number of the near infrared spectrum to be treated is that n is the characteristic variable number of the near infrared spectrum, x' i ∈[0,1](i.epsilon. {1,2, …, n }) dimensionless,/i ∈>Is the absorbance average value of the near infrared spectrum data,the standard deviation of absorbance of the external spectrum data is the final { x' i And (i.e {1,2, …, n }) is the pre-processed near infrared spectrum data.
5. The method according to claim 1, wherein the specific method of step 4 is as follows: randomly generating an initial set of variables of the near infrared spectrum containing Q epsilon {1,2, …, n } (n is the length of the near infrared spectrum), denoted as V 0 Wherein the length of the n near infrared spectrum; assuming that the current iteration number is i= {0,1,2, … }, the spectral eigenvalue number of this iteration is Q i The near infrared spectrum characteristic variable set is marked as V i Iterating according to the following steps;
(a) According to N (Q) i ,θ×Q i ) Generates a random number rand from the probability distribution of (a) i Record Q i+1 =[rand i ]Wherein θ is a value of [0,1 ]]Positive real numbers within the range, N (Q i ,θ×Q i ) Can ensure that when the characteristic variable number Q is selected i When larger, Q i+1 And Q is equal to i The greater the likelihood of a larger value difference; conversely, Q i+1 And Q is equal to i The greater the likelihood of a value difference, the less;
(b) If Q i+1 =Q i V is then i+1 =V i The method comprises the steps of carrying out a first treatment on the surface of the If Q i+1 <Q i Then utilize the characteristic variable set V of the spectrum data i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q i+1 The individual characteristic variables form a characteristic variable set V i+1 The method comprises the steps of carrying out a first treatment on the surface of the If Q i+1 >Q i Then from the set V-V i W (Q) i+1 -Q) feature variables, denoted W i Where V is the set of all spectral features, w > 1, and when w (Q i+1 -Q) > n-Q, W i =V-V i Using the set of characteristic variables V of the spectral data i +W i Constructing a PLS model, sorting the absolute values of the characteristic variable coefficients in the PLS model from large to small, and selecting the previous Q i+1 The individual characteristic variables form a characteristic variable set V i+1 ;
(c) Repeating the above steps until k times of circulation to obtain k+1 spectrum characteristic feature sets V A ={V 0 ,V 1 ,V 2 ,…,V k -a }; calculating each spectral feature v i (i.epsilon. {1,2, …, n }) at V A The frequency of occurrence of (a) is denoted as p i Selecting p therein i ≥p(p∈[0,1]) As a set of characteristic spectra that are ultimately used for near infrared modeling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310519468.0A CN116539553A (en) | 2023-05-10 | 2023-05-10 | Method for improving robustness of near infrared spectrum model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310519468.0A CN116539553A (en) | 2023-05-10 | 2023-05-10 | Method for improving robustness of near infrared spectrum model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116539553A true CN116539553A (en) | 2023-08-04 |
Family
ID=87448447
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310519468.0A Pending CN116539553A (en) | 2023-05-10 | 2023-05-10 | Method for improving robustness of near infrared spectrum model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116539553A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117269109A (en) * | 2023-11-23 | 2023-12-22 | 中国矿业大学(北京) | Method for detecting chloride ion content in concrete structure based on near infrared spectrum |
-
2023
- 2023-05-10 CN CN202310519468.0A patent/CN116539553A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117269109A (en) * | 2023-11-23 | 2023-12-22 | 中国矿业大学(北京) | Method for detecting chloride ion content in concrete structure based on near infrared spectrum |
CN117269109B (en) * | 2023-11-23 | 2024-02-23 | 中国矿业大学(北京) | Method for detecting chloride ion content in concrete structure based on near infrared spectrum |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Karimi et al. | Detection and quantification of food colorant adulteration in saffron sample using chemometric analysis of FT-IR spectra | |
CN105928901B (en) | A kind of near-infrared quantitative model construction method that qualitative, quantitative combines | |
CN110907393B (en) | Method and device for detecting saline-alkali stress degree of plants | |
CN116539553A (en) | Method for improving robustness of near infrared spectrum model | |
JP2015503763A5 (en) | ||
CN116701845A (en) | Aquatic product quality evaluation method and system based on data processing | |
CN113237836A (en) | Flue-cured tobacco leaf moisture content estimation method based on hyperspectral image | |
CN109060716B (en) | Near-infrared characteristic spectrum variable selection method based on window competitive self-adaptive re-weighting sampling strategy | |
CN109839362B (en) | Infrared spectrum quantitative analysis method based on progressive denoising technology | |
US20080154549A1 (en) | Noise-Component Removing Method | |
CN114417937A (en) | Deep learning-based Raman spectrum denoising method | |
CN113076692B (en) | Method for inverting nitrogen content of leaf | |
WO2022001829A1 (en) | Near-infrared spectrum wavelength screening method based on improved team progress algorithm | |
CN117332358B (en) | Corn soaking water treatment method and system | |
Zhang et al. | Uninformative Biological Variability Elimination in Apple Soluble Solids Content Inspection by Using Fourier Transform Near‐Infrared Spectroscopy Combined with Multivariate Analysis and Wavelength Selection Algorithm | |
CN112782115B (en) | Method for detecting consistency of sensory characteristics of cigarettes based on near infrared spectrum | |
CN112485217A (en) | Method and device for constructing meat identification model applied to origin tracing | |
CN115541531A (en) | Method for predicting protein content in feed based on two-dimensional correlation spectrum | |
Liu et al. | A novel wavelength selection strategy for chlorophyll prediction by MWPLS and GA | |
Yuan et al. | Application of hyperspectral imaging to discriminate waxy corn seed vigour after aging. | |
Liu et al. | Rapid determination of maturity in apple using outlier detection and calibration model optimization | |
CN111415715B (en) | Intelligent correction method, system and device based on multi-element spectrum data | |
CN113607681A (en) | Pleurotus eryngii mycelium detection method and device, electronic equipment and storage medium | |
CN113484270A (en) | Construction and detection method of single-grain rice fat content quantitative analysis model | |
CN110320174B (en) | Method for rapidly predicting time for smoldering yellow tea by applying polynomial net structure artificial neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |