Disclosure of Invention
The present invention aims to provide a method for processing overlapping peaks for chromatographic analysis, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme:
a method for processing overlapped peaks applied to chromatographic analysis specifically comprises the following steps:
s1, finding indexes of all peak values and inflection points by using a difference method: recording all adjacent non-repeated effective values, solving a derivative by using a first-order difference method, judging the sign of the derivative, and obtaining a peak point when the sign of the derivative is changed from positive to negative; the point where the sign of the derivative changes is the inflection point;
s2, only keeping indexes of peak values meeting the minimum peak height: giving the minimum height MinPeakHeight of the peak to be preserved, and deleting the peak which does not meet the height;
s3, only keeping indexes of peak values meeting the threshold value: giving a minimum Threshold Threshold of a peak to be reserved, wherein the Threshold is used for limiting the minimum height of a peak point from the left and right side points of the peak point, and deleting the peak with the minimum height smaller than the Threshold to avoid reserving a flat-top peak;
s4, determining the boundary index of each peak value: calculating a left base point, a left saddle point, a right base point and a right saddle point, wherein the maximum y value of the left base point and the right base point is the base point of the peak, taking the left base point as an example, scanning all inflection points, and searching the left base point of the peak leftwards in sequence when the peak stops: marking the current peak-valley point as valley, the corresponding peak-valley point as peak, marking the left-scanned peak-valley point as vi, and the corresponding peak-valley point as pi;
1) If vi > valley and pi > peak, recording valley as a left base point of the current peak; repeating the steps S1 to S4 to continuously search a left base point of the next peak;
2) If vi is greater than valley and pi is less than or equal to peak, repeating the steps S1-S4, and continuously searching the left base point forwards;
3) If vi < valley and pi > peak, recording valley as a left base point of the current peak; repeating the steps S1 to S4 to continuously search a left base point of the next peak;
4) If vi is less than valley and pi is less than or equal to peak, updating vi as the left base point of the current peak; repeating the steps S1 to S4 to continuously search for smaller peak-to-valley points until a point with a peak value larger than peak is encountered,
therefore, the left base points of all the peaks can be found, the left base points of the reserved peaks are stored, when the peak heights of the two peaks are the same, the saddle point is the nearest peak inflection point, and the base point is the minimum peak valley point; in other cases, the saddle point is the same as the base point, and the right base point is similar to the left base point, but the right base point is searched from the right to the left in the same way;
s5, only keeping indexes meeting the minimum protrusion: given the minimum protrusion minpeakpominence, the peaks smaller than the minimum protrusion are deleted, protrusion: the difference between the y coordinate of the peak point and the y coordinate of the highest base point is determined by the inherent height of the peak and the protrusion degree determined by the position relative to other peaks;
s6, solving the x coordinate of the half-height-width boundary of each peak: determine approximate height of full width at half maximum refHeight: half of the sum of the y coordinate of the peak point and the y coordinate of the base point, the left boundary of the full width at half maximum is found: searching the peak point to the left base point of the peak along the curve until finding a point with the y value closest to and greater than refHeight, recording the x coordinate of the point, and interpolating by utilizing the point and a point adjacent to the right side of the point and the refHeight point to obtain the x coordinate of a left boundary with half-height width; find the right boundary of full width at half maximum: searching the peak point to the right base point of the peak along the curve until finding a point with the y value closest to and greater than refHeight, recording the x coordinate of the point, and interpolating by utilizing the point and a point close to the left side of the point and the refHeight point to obtain the x coordinate of a right boundary with the half-height width;
s7, retaining only peaks within a given width range: giving a minimum peak searching width minW and a maximum peak searching width maxW, if no peak exists or the minimum width is 0 and the maximum width is infinite, not screening, calculating the full width at half maximum by using an x coordinate of a full width at half maximum boundary, and deleting peak values which are not in a given width range;
s8, finding out an index of the maximum peak value in the specified distance: sorting the peaks from large to small, judging from the larger peak to ensure that one small peak is not accidentally reserved and one large peak is removed nearby, if a certain peak is not nearby the larger peak, finding a secondary peak within a set distance range of the peak to eliminate, and performing the following loop operation for each effective peak:
marking the peak value in the range of the effective peak value index minD as 1, otherwise marking 0, and circulating next time, wherein the peak value marked with 1 is not scanned again, searching around the marked 0, and keeping the peak value index marked with 0;
after all the effective peaks needing to be reserved are determined, the effective peaks correspond to the original peak value index sequence again;
s9, sequencing the peak values, and limiting the number of the peaks: inputting the number numP of peaks to be detected, sequencing the detected peaks from large to small, and taking out the first numP peaks as a result;
s10, returning: and returning the coordinate information of the peak height, the peak position, the peak width, the projection and the left and right base points corresponding to the index value.
Preferably, when the flow rate and temperature settings are changed, there is a possibility of baseline fluctuations or drift, and it is assumed herein that the gas chromatograph is gradually brought close to a constant temperature and the components are in a steady state and the baseline is relatively flat, after a sufficiently long time has elapsed since the last change of operating conditions.
Preferably, the base point on the base line is the actual boundary of the peak, and the base point of the base line above a certain threshold is the intersection point of the overlapping peaks; the peak with the cross points is definitely an overlapped peak, and the number of the cross points plus 1 is the number of sub-peaks in the overlapped peak; the nearest non-intersection points on the left and right of the intersection point are the left and right boundary points of the overlapping peak; thus, the overlapping peak range can be determined by utilizing the left and right boundary points, so as to carry out fitting;
the algorithm comprises the following steps:
a) Performing de-duplication ascending operation on the base point obtained from the previous part by taking the x coordinate as a reference to obtain an ordered base point coordinate of the signal data;
b) According to the actual data condition, a threshold value threshold _ y in the y-axis direction is given, points higher than the threshold value are cross points, and points lower than the threshold value are boundary points;
c) Sequentially scanning the ordered base points, recording the previous boundary point as the left boundary point i _ left of the overlapped peak when the ordered base points meet the cross point, continuously scanning, recording the next boundary point as the right boundary point i _ right of the overlapped peak, and recording the number of the cross points plus one as the number i _ nump of the sub-peaks in the overlapped peak; continuing the above steps, and storing the left and right boundary points and the number of sub-peaks of all the overlapped peaks;
d) Calculating a fitting center and a fitting range window;
window=baseX(i_right)-baseX(i_left)
in the formula, baseX (i _ left) and baseX (i _ right) are respectively the x coordinates of the left base point and the right base point of the overlapping peak;
e) Determining an initial value start of a parameter to be fitted, namely the peak position and the peak width of each peak in the overlapped peaks, and calculating the initial value start from the last part;
f) Fit to each overlapping peak:
inputting the chromatographic signal data y, a fitting center, a fitting window, the number numP of peaks to be fitted and an initial value start parameter of a parameter to be fitted into an N-M simplex type iterative fitting model for fitting, obtaining a fitting result according to an evaluation standard, and updating the peak height and the peak width.
Preferably, the algorithm step of the N-M simplex iterative fitting model is:
firstly, constructing a function for calculating the mean square error of a model and an original signal, if the fitting error is larger than the required fitting precision, systematically changing parameters by a program, circulating to the previous step and repeating the previous step until the required fitting precision is reached or the maximum number is reached or iteration is carried out;
i) Constructing a parameter estimation function
Constructing a mean square error function of a calculation model and an original signal:
the following description is given of y model The calculation process of (2):
first of all a gaussian matrix a is determined,
where h is the number of peaks, n is the number of data, g (x, λ) i ) Is a Gaussian function;
the relation between the peak height matrix H, the Gaussian matrix A and the signal data matrix Y is H = abs (A \ Y) T ) And the result H is an approximation, then there is y model =A*H;
II) setting of the termination conditions
Setting the termination error threshold of the parameter lambda to be fitted to be 0.0000001 if lambda i -λ i-1 More than or equal to 0.0000001, replacing the estimation value of the unknown parameter according to the simplex iteration process of III), and carrying out the next iteration until lambda i -λ i-1 If the value is less than 0.0000001, terminating the iteration, otherwise, stopping the iteration for 1000 times;
the error calculation formula is as follows:
referred to as mean fit error, i.e., minimum fit error;
III) iterative procedure
Using the n-dimensionThe algorithm first sets an initial estimate x for a simplex consisting of n +1 points of the quantity x 0 For each part x 0 (i) Increase by 5% to the corresponding x 0 In, will divide x 0 The n-dimensional vectors outside are taken as the initial simplex, which the algorithm iterates over and over according to the following steps:
(1) Let x (i) denote the current simplex point data list, i =1, …, n +1;
(2) Sorting the simple type vertexes from small to large according to function values, wherein the shapes are as follows: f (x (1)) < … < f (x (n + 1)), at each step of the iteration, the algorithm discards the current worst pastry x (n + 1), receives another point as a simplex point;
(3) Generating a reflection point: r =2m-x (n + 1), wherein,
calculating f (r);
(4) If f (x (1)) ≦ f (r) < f (x (n)), accepting r, terminating the iteration, called a reflection;
(5) If f (r) < f (x (1)), the expansion point s is calculated, s = m +2 (m-x (n + 1)), and f(s):
a. if f(s) < f (r), accepting s, terminating the iteration, called expansion;
b. otherwise, accepting r, terminating iteration and reflect;
(6) If f (r) ≧ f (x (n)), compression processing is performed between m and the better of x (n + 1) and r:
a. if f (r) < f (x (n + 1)), (e.g., r is better than x (n + 1)), calculate c = m + (r-m)/2, and calculate f (c); if f (c) < f (r), accept c, terminate the iteration, called the extract output; otherwise, executing the step (7);
b. if f (r) ≧ f (x (n + 1)), calculating cc = m + (x (n + 1) -m)/2, and calculating f (cc), if f (cc) < f (x (n + 1)), accepting cc, terminating iteration, and containing insert, otherwise, proceeding to step (7);
(7) Calculate these n points:
calculating f (v (i)), i =2, …, n +1; the next iteration of the simplex isx (1), v (2), …, v (n + 1), called shrink.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention adopts a data acquisition, baseline noise reduction, calibration fitting and overlapping peak processing module; the method is characterized in that a core module is an overlapping peak processing module, the method comprises the steps of firstly determining the peak position by adopting a derivative method, then determining the overlapping peak by grouping base points, and fitting a peak shape curve by using a simplex method, thereby determining the information such as the peak height, the peak width, the protrusion and the like of each component of the overlapping peak, the method is high in accuracy of determining the peak position, high in accuracy of determining the information of each component of the overlapping peak, can automatically define the peak separating interval, and can automatically determine the number of sub-peaks in the overlapping peak, and has the advantages of accurately determining the peak position and the information of each component of the overlapping peak;
2. the method has the advantages that the concept is simple, differentiation is not needed, the method is not sensitive to the initial value, and each iteration only needs no more than 2 times of function evaluation, so the searching speed is high, and the like.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Referring to fig. 1-5, the present invention provides a chromatogram overlapping peak processing apparatus, which includes a data acquisition module 1, a data analysis module 2, a baseline noise reduction module 3, a calibration fitting module 4, and an overlapping peak processing module 5, wherein the data acquisition module 1 is connected to the data analysis module 2, the data analysis module 2 is connected to the baseline noise reduction module 3, the baseline noise reduction module 3 is connected to the calibration fitting module 4, and the marking fitting module is connected to the overlapping peak processing module 5.
A chromatographic overlapping peak processing method specifically comprises the following steps:
s1, finding indexes of all peak values and inflection points by using a difference method: recording all adjacent non-repeated effective value non-zero values, and utilizing a first-order difference method to solve a derivative. Judging the sign of the derivative, wherein the point where the sign of the derivative changes from positive to negative is a peak point; the point where the sign of the derivative changes is the inflection point; the inflection point includes a peak point
S2, only keeping indexes of peak values meeting the minimum peak height: giving the minimum height MinPeakHeight of the peak to be retained, and deleting the peak which does not meet the height;
s3, only keeping indexes of peak values meeting the threshold value: giving a minimum Threshold Threshold of a peak to be retained, wherein the Threshold is used for limiting the minimum height of a peak value point from the left and right side points of the peak value point, and deleting the peak with the minimum height smaller than the Threshold to avoid retaining a flat-top peak;
s4, determining the boundary index of each peak value: calculating a left base point, a left saddle point, a right base point and a right saddle point, wherein the maximum y value of the left base point and the right base point is the base point of the peak, taking the left base point as an example, scanning all inflection points, and searching the left base point of the peak leftwards in sequence when the peak stops: marking the current peak-valley point as valley, the corresponding peak-valley point as peak, marking the left-scanned peak-valley point as vi, and the corresponding peak-valley point as pi;
1, if vi > valley and pi > peak, recording valley as a left base point of the current peak; repeating the steps 1-4 to continuously search the left base point of the next peak;
2 if vi is greater than valley and pi is less than or equal to peak, repeating the steps 1-4, and continuously searching the left base point forwards;
3 if vi < valley and pi > peak, recording valley as a left base point of the current peak; repeating the steps 1-4 to continuously search the left base point of the next peak;
4 if vi is less than valley and pi is less than or equal to peak, updating vi as the left base point of the current peak; and repeating the steps 1-4 to continuously search for smaller peak-to-valley points until a point with a peak value larger than peak is met.
Therefore, the left base points of all the peaks can be found, the left base points of the reserved peaks are stored, when the peak heights of the two peaks are the same, the saddle point is the nearest peak inflection point, and the base point is the minimum peak valley point; in other cases, the saddle point is the same as the base point, and the right base point is similar to the left base point, but the right base point is searched from the right to the left in the same way;
s5, only keeping indexes meeting the minimum protrusion: given the minimum protrusion minpeakpromience, peaks smaller than the minimum protrusion are deleted. And (3) protrusion: the difference between the y coordinate of the peak point and the y coordinate of the highest base point is determined by the inherent height of the peak and the protrusion degree determined by the position relative to other peaks;
s6, solving the x coordinate of the half-height-width boundary of each peak: determine approximate height of full width at half maximum refHeight: half of the sum of the y coordinate of the peak point and the y coordinate of the base point. Find the left boundary of full width at half maximum: searching the peak point to the left base point of the peak along the curve until finding a point with the y value closest to and greater than refHeight, recording the x coordinate of the point, and interpolating by utilizing the point and a point adjacent to the right side of the point and the refHeight point to obtain the x coordinate of a left boundary with half-height width; find the right boundary of full width at half maximum: searching the peak point to the right base point of the peak along the curve until finding a point with the y value closest to and greater than refHeight, recording the x coordinate of the point, and interpolating by utilizing the point and a point close to the left side of the point and the refHeight point to obtain the x coordinate of a right boundary with the half-height width;
s7, retaining only peaks within a given width range: giving a minimum peak searching width minW and a maximum peak searching width maxW, if no peak exists, or the minimum width is 0 and the maximum width is infinite, not screening, calculating the half-height width by using an x coordinate of a half-height width boundary, and deleting peak values which are not in a given width range;
s8, finding out the index of the maximum peak value in the specified distance: the peaks are sorted from large to small, starting with the larger peak to ensure that a small peak is not accidentally retained and a large peak is removed in the vicinity. If a certain peak value is not near to a larger peak value, finding a secondary peak value within a set distance range of the peak value for elimination, and performing the following loop operation for each effective peak:
marking the peak value within the set distance range of the effective peak value index minD as 1 to indicate that the peak value is deleted, otherwise marking 0 to indicate that the peak value is not deleted, circulating next time, searching around the marking 0 without scanning the peak value of the marking 1, and keeping the peak value index of the marking 0;
after all the effective peaks needing to be reserved are determined, the effective peaks correspond to the original peak value index sequence again;
s9, sequencing the peak values, and limiting the number of the peaks: inputting the number numP of peaks to be detected, sorting the detected peaks from large to small, and taking out the first numP peaks as the result;
s10, returning: and returning information such as peak height, peak position, peak width, protrusion, coordinates of left and right base points and the like corresponding to the index value.
In the invention, a data acquisition module 1 acquires a voltage signal of an FID gas chromatography detection system, the chromatographic signal is filtered and amplified by hardware, and a signal value is sent to a PC end by an analog-to-digital converter and a singlechip control system;
in the invention, a baseline noise reduction module 3 carries out noise removal processing on collected chromatographic data, adopts SG filter and morphological filter combination to carry out real-time data filtering, and uses SG filter with a filtering window of 19 and morphological filter with structural elements of g1 and g2 to carry out filtering on collected voltage signals;
in the invention, a calibration fitting module 4 calibrates the concentration of a chromatographic analyzer by a standard sample gas sample, stores the calibration result in a configuration file, performs linear conversion by a linear interpolation method according to a calibration table when the concentration of each component is specifically calculated, firstly introduces standard gas with known concentration components as benzene series, calibrates the components and the concentration of each peak by observing a spectrogram curve, calibrates the components and the concentration of each peak two to three times, and fits the relationship between the peak height and the concentration of each substance to obtain a fitting curve
Y = a X + b, where Y is the concentration and X is the peak height. And substituting the peak height into a curve equation to obtain the measured real-time concentration.
Two species were measured separately for two experiments:
benzene: a nominal concentration of 50, a nominal height of 56.593,
calibration concentration of 40, calibration height of 45.274
Calibration relation curve y = 0.8835 x
Toluene: a nominal concentration of 60, a nominal height of 61.516,
calibration concentration of 50, calibration height of 51.261
Calibration relation curve y = 0.9754 x;
in the invention, an overlapped peak processing module 5 separates overlapped peaks of chromatographic data, a chromatographic method works according to the tiny difference of the physical and chemical properties of substances, when the physical and chemical properties of two components are similar, the effluent chromatographic curves are very close on a time axis, so that the obtained chromatographic peaks are overlapped, the module is used for separating the chromatographic peaks so as to obtain the substance types and the concentrations of the components, and a more accurate real signal is obtained after filtering, so that component analysis can be carried out. The component analysis is to analyze the components and concentrations of the substances to be detected based on the peak appearance of the spectrum.
The performance of the overlapping peak separation method is evaluated from three aspects of an intuitive separation effect graph, an evaluation standard of a fitting effect and separation errors under different separation degrees; and finally the performance of the whole device is verified through experiments.
First, the calculation method of each index is explained as follows:
real area:
with the integration concept, the real area is the sum of all y values of the Gaussian peaks, and the x-axis has the unit of 1.
Area error:
area error = fitted area-true area
Note that it is different from the separation error.
Signal-to-noise ratio:
wherein, N is the data number, x is the data vector of the signal with noise, and y is the data vector of the pure signal. Only the signal-to-noise ratio of the fitting segment data is calculated herein.
Relative Standard Deviation (RSD):
where S is the standard deviation (which may also be expressed as SD),
are the corresponding average values.
Correlation coefficient squared (R2):
a. and a variance (SSE) of square products to error, the statistical parameter being calculated as The sum of The squares of The errors of The corresponding points of The fitting data and The original data, the calculation formula being as follows:
the closer the SSE is to 0, the better the model selection and fitting, and the more successful the data prediction. The following MSE and RMSE, since SSE is the same,
the effect is the same.
SST: total sum of squares, the sum of the squares of the differences between the raw data and the mean, is given by the formula:
c. square of correlation coefficient
In essence, the "coefficient of certainty" characterizes how well a fit is by the change in data. From the above expression, it can be known that the normal value range of the "determination coefficient" is [0,1], the closer to 1, the stronger the explanatory ability of the variable of the equation to y is, and the better the model fits the data.
The experimental procedure is described below:
1) Evaluation of fitting Effect
The experiment simulates 10 different overlapping peaks, which consist of Gaussian peaks with a width of 10-50 and a peak position of 70-120, and is added with Gaussian noise with a floating range of 0.003-0.05 and baseline noise with a slope of 0.001. The results of the experiments are shown in the following table:
TABLE 1 comparison table of fitting effect of 10 overlapping peaks
As can be seen from the table above, for the signal data with the average signal-to-noise ratio of 72.64, the fitting error is less than 0.6%, and the correlation coefficient is as high as 0.999 or higher, indicating that the fitting effect is good.
In order to prove the stability of the algorithm, the method also performs the following experiments: and 6 kinds of noises with different signal-to-noise ratios are added to the same analog data (formed by overlapping two Gaussian peaks with the widths of 10 and 20 respectively) to verify the fitting effect. The results of the experiments are shown in the following table:
TABLE 2 comparison table of fitting effect of overlapping peaks of the same sample at different times
From the above table, in the process that the signal-to-noise ratio is reduced from 94.58 to 53.53, the fitting error of the signal data is less than 0.7% (mean square error is 0.18), and the correlation coefficient is as high as more than 0.999 (mean square error is 0.0001), which indicates that the fitting method is stable.
1) Error of separation
The separation degree is used as an index of the separation efficiency of the chromatographic column, the separation condition of two adjacent components in the chromatographic column can be judged, and the larger the separation degree R is, the better the separation of the two adjacent components is. Generally, when the degree of separation R <1, the two peaks partially overlap.
The resolution calculation formula is as follows:
in the formula: the retention times of the 2 single peaks constituting the overlapping peak, respectively, were the peak bottom widths of the 2 single peaks, respectively.
Four kinds of simulation data with the separation degrees of 0.50, 0.67, 0.84 and 1.01 are designed in the experiment, and Gaussian noise and baseline noise with the same floating range are added. The 4 groups of data are tested, and the separation error is calculated by the following formula:
in the formula: s X Is the area of the X peak.
The results of the experiment are reported below:
TABLE 3 separation error under different degrees of separation
As can be seen from the above table, the separation error of the signal data with an average signal-to-noise ratio of 68.11 is within + -1.6.
2) Graph of separation effect
The separation effect can be seen visually from the separation effect chart, three different overlapped peaks are selected for processing in the experiment, and the separation effect is shown in attached figures 3, 4 and 5.
3) Device performance
The device is mainly used for separating overlapped peaks and measuring components and concentrations of all components, so that the measurement performance is the relative standard deviation and repeatability of each measured component.
Because the properties of the benzene series are similar and overlapping peaks are easy to generate, the benzene series is taken as an experimental object in the experiment, and the measured gas is diluted benzene series standard gas. The following table shows the benzene analysis results in 9 experiments:
TABLE 4 test results of benzene series
Measuring content
|
Retention time (min)
|
Peak height (mv)
|
Peak width (ms)
|
Peak area (ms. Mv)
|
1
|
1.6683
|
56.8576
|
2993.7300
|
170216.3028
|
2
|
1.6633
|
58.2470
|
2967.4400
|
172844.4777
|
3
|
1.6633
|
55.6954
|
2964.8200
|
165126.8358
|
4
|
1.6617
|
55.8151
|
2938.5500
|
164015.4621
|
5
|
1.6633
|
56.5659
|
2963.1600
|
167613.8122
|
6
|
1.6750
|
56.9446
|
2937.3200
|
167264.5125
|
7
|
1.6700
|
56.2928
|
2939.9800
|
165499.7061
|
8
|
1.6650
|
56.6291
|
2883.5900
|
163295.1065
|
9
|
1.6667
|
56.6744
|
2927.5800
|
165918.8400
|
Mean value of
|
1.6663
|
56.6358
|
2946.2411
|
166866.1173
|
Standard deviation of
|
0.0040
|
0.7010
|
29.4153
|
2874.2181
|
Relative Standard Deviation (SD)
|
0.2394%
|
1.2377%
|
0.9984%
|
1.7225% |
Generally, the change of the continuous sampling retention time in the measurement of the sample gas with the same concentration needs to meet the requirement of less than +/-1%, and the change of the peak height and the peak area needs to meet the requirement of less than +/-3%. As can be seen from the above table, the relative standard deviations of retention time, peak height and peak area measured after 9 times of data measurement are only 0.24%,1.24% and 1.72%, respectively, so that qualitative repeatability analysis using retention time as a standard meets the requirement of ± 1%, and meanwhile, quantitative repeatability analysis using peak height and peak area as standards meets the requirement of less than ± 3%;
in the present invention, fluctuations or drifts in the baseline may occur when the flow and temperature settings are changed in steps 1 to S10, assuming that the gas chromatograph is gradually brought close to a constant temperature, the components are in a steady state, and the baseline is relatively stable, after a sufficiently long time has elapsed since the operating conditions were last changed;
in the present invention, the base point peak-bottom point on the base line is the actual boundary of the peak, and the base point peak-bottom point of the base line higher than a certain threshold value is the intersection of the overlapping peaks. The peak with the cross points is a certain overlapped peak, and the number of the cross points plus 1 is the number of sub-peaks in the overlapped peak. The nearest non-intersection points around the intersection point are the left and right boundary points of the overlapping peak. Thus, the overlapping peak range can be determined by using the left and right boundary points, and fitting is performed.
The algorithm comprises the following steps:
a) And performing de-duplication ascending operation on the base points obtained by the previous part of derivative peak searching method by taking the x coordinate as a reference to obtain the peak-valley point coordinates of the ordered base points of the signal data.
b) According to the actual data situation, given a threshold value threshold _ y in the y-axis direction, points higher than the threshold value are intersection points, and points lower than the threshold value are boundary points. If the threshold is too small, the peak with better separation may be judged as an overlapping peak, and if the threshold is too large, some overlapping peaks with large separation degree may be ignored, so as to give up the viewing requirement.
c) And sequentially scanning the ordered base points, recording the previous boundary point as the left boundary point i _ left of the overlapped peak when an intersection is encountered, continuing to scan, recording the next boundary point as the right boundary point i _ right of the overlapped peak when the next boundary point is encountered, and recording the number of the intersection points plus one as the number i _ nump of the sub-peaks in the overlapped peak. And continuing to store the left and right boundary points and the number of sub-peaks of all the overlapped peaks.
d) The fitting center and the fitting range window are calculated.
window=baseX(i_right)-baseX(i_left)
In the formula, baseX (i _ left) and baseX (i _ right) are x coordinates of left and right base points of the overlapping peak, respectively.
e) The initial value start of the parameter to be fitted is determined. I.e. the peak position and width of each peak in the overlapping peaks, are calculated by the derivative peak-finding method of the previous part.
f) Fit to each overlapping peak:
inputting parameters such as chromatographic signal data y, a fitting center, a fitting window, the number numP of peaks to be fitted, an initial value start of a parameter to be fitted and the like into an N-M simplex type iterative fitting model for fitting, obtaining a fitting result according to an evaluation standard, and updating the peak height and the peak width;
in the invention, the Nelder-Mead simplex algorithm is a direct search method for optimizing a multidimensional unconstrained problem.
The basic idea is to use the search starting point x in the m-dimensional parameter space 0 And constructing a polyhedron with m +1 linear independent vertexes, determining the next search direction by comparing objective function values of the vertexes, performing heuristic operations of reflection, expansion, contraction and compression side length on the polygon, and replacing the worst point with a better new vertex to form a new polyhedron. And continuously adjusting the parameter values in such a way, and finally approaching the optimal solution of the target function.
The algorithm steps of the N-M simplex iterative fitting model are as follows:
first, a function is constructed that calculates the mean square error of the model and the original signal, and if the fitting error is greater than the desired fitting accuracy, the program systematically changes the parameters and loops to the previous step and repeats until the desired fitting accuracy is achieved or the maximum number or iteration is achieved.
And I, constructing a parameter estimation function.
Constructing a mean square error function of a calculation model and an original signal:
the following description is given of model The calculation process of (2):
first of all a gaussian matrix a is determined,
where h is the number of peaks, n is the number of data, g (x, λ) i ) Is a gaussian function.
The relation among the peak height matrix H, the Gaussian matrix A and the signal data matrix Y is H = abs (A \ Y) T ) And the result H is an approximation, then there is y model =A*H。
II setting of end conditions
Setting the termination error threshold of the parameter lambda to be fitted to be 0.0000001 if lambda i -λ i-1 If the absolute value of the parameter is more than or equal to 0.0000001, replacing the estimated value of the unknown parameter according to the III simplex iteration process, and carrying out the next iteration until lambda is reached i -λ i-1 <0.0000001, terminating the iteration, otherwise 1000 iterations stop.
The error calculation formula is as follows:
referred to as mean fit error, meanFitError, MFE, i.e., minimum fit error.
III iterative procedure
Using a simplex consisting of n +1 points of an n-dimensional vector x, the algorithm first sets an initial estimate x 0 For each part x 0 (i) Increase by 5% to the corresponding x 0 In, will divide x 0 The n-dimensional vector outside as the initial simplex if x 0 (i) =0, then 0.00025 is used as component i, and the algorithm iterates through the simplex iteratively according to the following steps:
(1) Let x (i) denote the current simplex point data list, i =1, …, n +1.
(2) Sorting the simple type vertexes from small to large according to function values, wherein the shapes are as follows: f (x (1)) < … <
f (x (n + 1)), at each step of the iteration the algorithm discards the current worst pastry x (n + 1), receives another point as a simplex point or, in step 7 below, it changes all n points greater than f (x (1)).
(3) Generating a reflection point: r =2m-x (n + 1), wherein,
f (r) is calculated.
(4) If f (x (1)) ≦ f (r) < f (x (n)), accepting r, and terminating the iteration. Known as reflection
(5) If f (r) < f (x (1)), the expansion point s, s = m +2 (m-x (n +) is calculated
1) And calculating f(s):
a. if f(s) < f (r), accept s, terminate the iteration, called extended
b. Otherwise, accept r, terminate iteration, reflect.
(6) If f (r) ≧ f (x (n)), compression processing is performed between m and the better of x (n + 1) and r:
a. if f (r) < f (x (n + 1)), (e.g., r is better than x (n + 1)), c = m + (r-m)/2 is calculated, and f (c) is calculated. If f (c) < f (r), c is accepted, terminating the iteration, called the extract output. Otherwise, step 7 shrink is performed.
b. If f (r) ≧ f (x (n + 1)), calculate cc = m + (x (n + 1) -m)/2, and calculate f (cc), if f (cc) < f (x (n + 1)), accept cc, terminate the iteration, track insert, otherwise, proceed to step 7.
Calculate these n points:
calculate f (v (i)), i =2, …,
n +1. The next iteration of simplex is x (1), v (2), …, v (n + 1), called shrink.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It should be understood by those skilled in the art that the present invention is not limited to the above embodiments, and the above embodiments and descriptions are only preferred examples of the present invention and are not intended to limit the present invention, and that various changes and modifications may be made without departing from the spirit and scope of the present invention, which fall within the scope of the claimed invention. The scope of the invention is defined by the appended claims and equivalents thereof.