CN110097920A

CN110097920A - A kind of metabolism group shortage of data value fill method based on neighbour's stability

Info

Publication number: CN110097920A
Application number: CN201910284004.XA
Authority: CN
Inventors: 罗霄; 李超; 林晓惠
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-04-10
Filing date: 2019-04-10
Publication date: 2019-08-06
Anticipated expiration: 2039-04-10
Also published as: CN110097920B

Abstract

The present invention provides a kind of metabolism group shortage of data value fill method based on neighbour's stability, belongs to metabolism group data analysis technique field.The core technology of this method is that the stability for the k nearest samples content on corresponding metabolin of sample for measuring the metabolin containing missing is respectively adopted different strategies to different types of missing values and is filled based on stable neighbour's sample.The present invention is preferable to the metabolism group data filling effect containing missing values, and to subsequent data analysis, metabolic markers selection etc. is of great significance.

Description

A kind of metabolism group shortage of data value fill method based on neighbour's stability

Technical field

The invention belongs to metabolism group data analysis technique fields, are related to a kind of metabolism group number based on neighbour's stability It is a kind of deletion type for considering metabolin missing values according to Missing Data Filling method, similarity relation and neighbour between sample The metabolism group shortage of data value fill method of Almost Sure Sample Stability.

Background technique

Metabolism group by carrying out the qualitative of system to the intracorporal molecule metabolites of biology and determine quantifier elimination, finding and Physiological and pathological changes relevant metabolin.Carrying out qualitative and quantitative method to different metabolins includes mass spectrometry and core Magnetic resonance spectrum etc..In general, by mass spectrometry obtain metabolism group data in there are many missing values.These missing values are main From two aspects: first is that the random error being introduced into data acquisition or in instrumentation leads to certain metabolism in sample Object content is not detected among out, and this shortage of data type is referred to as missing at random；Second is that the content of metabolin in the sample Detection lower than mass spectrometer is limited without being detected, and this shortage of data type is referred to as Missing.Example Such as, metabolin bile acid in human body concentration variation very greatly, due to the presence of instrument detection limit, in the metabolism group data of acquisition Bile acid biosynthesis object may be missing values in many samples.However, conventional data analysing method is only applicable to processing completely The data matrix without missing values.If directly by metabolism group data containing missing values metabolin or sample leave out, Many valuable information can be lost.It therefore is that metabolism group data are analyzed using simple and highly efficient method filling missing data In an important task, to subsequent data analysis, metabolic markers selection etc. is of great significance for this.

Some metabolism group shortage of data value processing methods use zero, the minimum value of metabolite content, the one of minimum value The fillings such as half or median correspond to the missing values of metabolin.These methods are relatively simple, but are easy to produce subsequent data analysis Raw larger impact.Missing Data Filling algorithm based on k arest neighbors is the common a kind of side of missing values in processing metabolism group data Method.This method thinks that similitude is bigger between sample, and content deviation is smaller between their metabolin.If the metabolism of sample s The content of object m lacks, based on the Missing Data Filling algorithm of k arest neighbors according to k arest neighbors of similarity measurement searching and sample s Sample (if k nearest samples correspond to the content missing of metabolin m, is substituted) with subsequent neighbour, then recently using k The weighted average of the content of the metabolin m of adjacent sample come fill sample s missing metabolin m content.Based on k arest neighbors Missing Data Filling algorithm can preferably handle the data of missing at random type in metabolism group data, but to Missing class Type data filling effect is not ideal enough.

Method proposes a kind of metabolism group shortage of data value fill method based on neighbour's stability.This method according to Euclidean distance between sample determines k nearest samples of the sample of the metabolin containing missing, evaluates the stability of neighbour's sample, Different types of missing values are filled using corresponding strategy based on stable neighbour's sample.

Summary of the invention

The purpose of the present invention is the missing values in filling metabolism group data.The core technology of this method is measurement containing missing The stability of k nearest samples content on corresponding metabolin of the sample of metabolin, based on stable neighbour's sample, to not The missing values of same type are respectively adopted different strategies and are filled.

In order to achieve the above objectives, The technical solution adopted by the invention is as follows:

A kind of metabolism group shortage of data value fill method based on neighbour's stability, steps are as follows:

Using the Metabolite in mass spectrometry detection biological sample, and the spectrum data of Metabolite is obtained, used The pretreatment operations such as peak identification, peak match, normalization analyze spectrum data, and determine metabolite content in sample, obtain Obtain metabolism group data.

The quantity of sample in metabolism group data is indicated with n, p indicates the quantity of metabolin in sample, x_i=(x_i1, x_i2,…,x_ip) indicate the value vector that the content of p metabolin in i-th of sample forms, 1≤i≤n.When metabolism group data Middle sample x_i(the x that the content of middle metabolin m is missing from_imFor missing values), 1≤m≤p, then by following steps to missing values x_im It is filled:

(1) sample x is calculated_iWith sample x_jEuclidean distance d (the x of (1≤i ≠ j≤n)_i,x_j), formula is as follows:

Wherein o_ilIndicate sample x_iThe content of l (1≤l≤p) a metabolin whether lack, as sample x_iFirst When the content missing of metabolin, o_il=0, otherwise o_il=1.It indicates in sample x_i With sample x_jThe metabolin quantity that middle content does not lack.Distance d (x_i,x_j) smaller, x_iWith x_jBetween similarity it is higher.Pass through Euclidean distance determination and sample x_iK most like sample constitutes sample set S_k；

(2) judge the deletion type of metabolin.

Calculate the Pearson correlation coefficients between metabolin m and other metabolins.It finds out and the strongest metabolin of m correlation Reference metabolin of the aux_m as m.According to the content distribution situation of reference metabolin aux_m, the missing class of metabolin m is judged Type, deterministic process are as follows:

Enable S_miss={ x_j|x_jmFor missing values, 1≤j≤n } indicate metabolism group data in metabolin m be missing values sample This set.Enable S_obs={ x_j|x_jmIt is not missing values, 1≤j≤n } indicate that metabolin m is not missing values in metabolism group data Sample set.It calculates separately with reference to metabolin aux_m in sample set S_missAnd S_obsOn average content, be denoted as μ_missAnd μ_obs。 When metabolin m and aux_m are positively correlated and μ_miss< μ_obsWhen, then the deletion type of m is Missing, enters step (3)； Conversely, the deletion type of m is missing at random, (4) are entered step.When metabolin m and aux_m is negatively correlated and μ_miss> μ_obsWhen, Then the deletion type of m is Missing, enters step (3)；Conversely, the deletion type of m is missing at random, (4) are entered step.

(3) Missing type processing mode.

Work as S_kIt is middle there are the content of the metabolin m of sample missing when, using metabolin m in metabolism group data all samples Minimum content value in sheet temporarily fills S_kThe content value of the m of middle sample missing.The step considers wraps in metabolism group data The situation of the data containing Missing.The appearance of Missing value be because metabolite content lower than instrument detection limit and not It is detected.More meet nonrandom lack with the content value that the minimum content value of metabolin m temporarily fills the m of neighbour's sample missing The characteristics of losing data.

(4) missing at random type processing mode.

Work as S_kWhen the middle missing there are the content of the metabolin m of sample, then using metabolin m in S_kWhat middle content did not lack The average content of sample metabolin m, temporarily fills S_kThe content value of the m of middle sample missing.Work as S_kThe metabolin m's of middle sample contains When amount lacks, then temporarily being filled using minimum content value of the metabolin m in metabolism group data on remaining all sample S_kThe content value of the m of the missing of middle sample.

(5) stable neighbour's sample is determined.

According to S_kThe content degree of fluctuation of the metabolin m of middle sample determines S_kMiddle stable neighbour's sample.Calculate S_kMiddle sample The average value mu and standard deviation sigma of metabolin m content.Work as S_kIt is middle there are the content of the metabolin m of sample [μ-σ, μ+σ] range it Outside, then by sample from S_kIn leave out, finally obtain stable neighbour's sample set S '_k.Because of metabolite content between neighbour's sample Deviation fluctuate very little, the sample except [μ-σ, μ+σ] range is left out into the influence that can reduce exceptional value, to guarantee Filling power Stability and reliability when calculating.

(6) S ' is calculated_kThe weighted average of middle sample metabolin m content, the x calculated using formula (3)_imFill sample x_i Missing metabolin m content.Formula is as follows:

Wherein, k '=| S '_k| indicate sample set S '_kThe quantity of middle sample, s_j,s_l(1≤j, l≤k ') is S '_kIn sample, w(x_i,s_j) indicate sample s_jMetabolin m content calculate x_imWhen the weight that accounts for.d(x_i,s_j) indicate to be calculated by formula (1) Sample x_iWith s_jEuclidean distance, s_lmIndicate sample s_lMetabolin m content.According to neighbour's sample and sample x_iDistance is big The content of the small m to different neighbour's samples assigns different weights.S′_kMiddle sample and sample x_iApart from smaller, its metabolin m Content weight it is bigger, to calculate x_imThe specific gravity accounted for is bigger.

Beneficial effects of the present invention:

The present invention is for filling metabolism group missing data, it is contemplated that the missing Value Types of metabolin are lacked for different It loses Value Types and missing values is filled using different strategies；Neighbour's sample is screened simultaneously, filters unstable neighbour's sample This.The present invention is preferable to the metabolism group data filling effect containing missing values, to subsequent data analysis, metabolic markers selection Etc. being of great significance.

Specific embodiment

The specific embodiment of this method is further illustrated in analogue data below with reference to technical solution, analogue data is only Be limited to illustrate the present invention in order to understand, rather than limitation of the present invention.

It is analogue data of the invention, x in table 1_iIndicate i-th of sample, data include 10 samples, m₁~m₅Indicate number 5 metabolins in, NaN indicate the missing values in data.

Table 1: analogue data

Contain 4 missing values in 1 data of table, is x respectively₁₃,x₅₂,x₈₄,x₉₃.Below with x₁₃For illustrate.

(1) sample x is calculated using formula (1)₁The distance between other samples d, obtains: d (x₁,x₂)=1.94, d (x₁,x₃)=1.73, d (x₁,x₄)=3.39, d (x₁,x₅)=3.46, d (x₁,x₆)=4.12, d (x₁,x₇)=2.29, d (x₁, x₈)=2.71, d (x₁,x₉)=2.74, d (x₁,x₁₀)=3.16.K=6 is enabled, then with sample x₁Most like 6 samples composition Collection be combined into S_k={ x₃,x₂,x₇,x₈,x₉,x₁₀}。

(2) judge metabolin m₃Deletion type.Calculate m₃With m₁, m₂, m₄, m₅Pearson correlation coefficients.It is computed, m₄ With m₃Correlation is most strong, and is positively correlated, then chooses m₄For m₃Reference metabolin.Metabolin m in data₃For the sample of missing values This set S_miss={ x₁,x₉},m₃The sample set not lacked is S_obs={ x₂,x₃,x₄,x₅,x₆,x₇,x₈,x₁₀}.With reference to metabolism Object m₄In S_missOn average value mu_missIt is 7, in S_obsOn average value mu_obsIt is 4.86.μ_miss≥μ_obs, then deletion type be with Machine missing, enters step (4).

(3) in x₁6 nearest samples in, sample x₉Metabolin m₃For missing values, then using x₃,x₂,x₇,x₈, x₁₀Metabolin m₃Average value 6 temporarily fill x₉₃Value.

(4) sample set S_kThe m of middle sample₃Corresponding value is { 3,9,5,7,6,6 }, S_kThe m of middle sample₃Mean μ=6, mark Quasi- difference σ=2.So stable region is [4,8].Value x₃₃,x₂₃Except stable region, so by sample x₃,x₂From S_kMiddle deletion, So S '_k={ x₇,x₈,x₉,x₁₀}。

(5) D ' is calculated_kThe weight of middle sample.Using formula (2), S ' is obtained_kIn each sample weight are as follows: w (x₁,x₇) =0.29, w (x₁,x₈)=0.25, w (x₁,x₉)=0.25, w (x₁,x₁₀)=0.21.Using formula (3), weighted average is calculated x₁₃=w (x₁,x₇)*x₇₃+w(x₁,x₈)*x₈₃+w(x₁,x₉)*x₉₃+w(x₁,x₁₀)*x_10,3=5.95.So by 5.95 as missing Value x₁₃Estimation Filling power.

To missing values x₅₂,x₈₄,x₉₃Step (1)-(6) are respectively adopted to be filled.

Claims

1. a kind of metabolism group shortage of data value fill method based on neighbour's stability, which is characterized in that steps are as follows:

Using the Metabolite in mass spectrometry detection biological sample, and the spectrum data of Metabolite is obtained, is known using peak Not, peak match, normalization pretreatment operation analyze spectrum data, and determine metabolite content in sample, are metabolized Group learns data；

The quantity of sample in metabolism group data is indicated with n, p indicates the quantity of metabolin in sample, x_i=(x_i1,x_i2,…,x_ip) Indicate the value vector of the content composition of p metabolin in i-th of sample, 1≤i≤n；As sample x in metabolism group data_iIn What the content of metabolin m was missing from, i.e. x_imFor missing values, 1≤m≤p, then by following steps to missing values x_imIt is filled:

(1) sample x is calculated_iWith other sample x_jEuclidean distance d (x_i,x_j), 1≤i ≠ j≤n, formula is as follows:

Wherein, o_ilIndicate sample x_iThe content of first of metabolin whether lack, 1≤l≤p, as sample x_iFirst of metabolin Content missing when, o_il=0, otherwise o_il=1；It indicates in sample x_iWith sample x_jThe metabolin quantity that middle content does not lack；Distance d (x_i,x_j) smaller, x_iWith x_jBetween similarity it is higher；By it is European away from From determining and sample x_iK most like sample constitutes sample set S_k；

(2) judge the deletion type of metabolin

Calculate the Pearson correlation coefficients between metabolin m and other metabolins；It finds out and the strongest metabolin aux_ of m correlation Reference metabolin of the m as m；According to the content distribution situation of reference metabolin aux_m, judges the deletion type of metabolin m, sentence Disconnected process is as follows:

Enable S_miss={ x_j|x_jmFor missing values, 1≤j≤n } indicate metabolism group data in metabolin m be missing values sample set It closes；Enable S_obs={ x_j|x_jmNot be missing values, 1≤j≤n } indicate metabolism group data in metabolin m be not missing values sample Set；It calculates separately with reference to metabolin aux_m in sample set S_missAnd S_obsOn average content, be denoted as μ_missAnd μ_obs；The present age It thanks to object m and aux_m is positively correlated and μ_miss< μ_obsWhen, then the deletion type of m is Missing, enters step (3)；Conversely, The deletion type of m is missing at random, enters step (4)；When metabolin m and aux_m is negatively correlated and μ_miss> μ_obsWhen, then m Deletion type is Missing, enters step (3)；Conversely, the deletion type of m is missing at random, (4) are entered step；

(3) Missing type processing mode

Work as S_kWhen the middle missing there are the content of the metabolin m of sample, using metabolin m in metabolism group data on all samples Minimum content value temporarily fill S_kThe content value of the m of middle sample missing；

(4) missing at random type processing mode

Work as S_kWhen the middle missing there are the content of the metabolin m of sample, then using metabolin m in S_kThe sample that middle content does not lack The average content of metabolin m, temporarily fills S_kThe content value of the m of middle sample missing；Work as S_kThe content of the metabolin m of middle sample is equal When missing, then temporarily filling S using minimum content value of the metabolin m in metabolism group data on remaining all sample_kIn The content value of the m of the missing of sample；

(5) stable neighbour's sample is determined

According to S_kThe content degree of fluctuation of the metabolin m of middle sample determines S_kMiddle stable neighbour's sample；Calculate S_kMiddle sample metabolism The average value mu and standard deviation sigma of object m content；Work as S_kMiddle there are the contents of the metabolin m of sample except [μ-σ, μ+σ] range, then By sample from S_kIn leave out, finally obtain stable neighbour's sample set S '_k；

(6) S ' is calculated_kThe weighted average of middle sample metabolin m content；The x calculated using formula (3)_imFill sample x_iLack The content of metabolin m is lost, formula is as follows:

Wherein, k '=| S '_k| indicate sample set S '_kThe quantity of middle sample, s_j,s_l(1≤j, l≤k ') is S '_kIn sample, w (x_i,s_j) indicate sample s_jMetabolin m content calculate x_imWhen the weight that accounts for；d(x_i,s_j) indicate to be calculated by formula (1) Sample x_iWith s_jEuclidean distance, s_lmIndicate sample s_lMetabolin m content；According to neighbour's sample and sample x_iApart from size Different weights is assigned to the content of the m of different neighbour's samples；S′_kMiddle sample and sample x_iApart from smaller, its metabolin m's Content weight is bigger, to calculating x_imThe specific gravity accounted for is bigger.