JP2004013573A

JP2004013573A - Processing method for gene expression data, and processing program

Info

Publication number: JP2004013573A
Application number: JP2002166946A
Authority: JP
Inventors: Tomokazu Konishi; 小西　智一
Original assignee: Todai TLO Ltd; Center for Advanced Science and Technology Incubation Ltd
Current assignee: Todai TLO Ltd
Priority date: 2002-06-07
Filing date: 2002-06-07
Publication date: 2004-01-15
Anticipated expiration: 2022-06-07
Also published as: JP4266575B2

Abstract

<P>PROBLEM TO BE SOLVED: To more precisely analyze gene expression data obtained from a DNA chip and the like. <P>SOLUTION: A sorting/sampling processing part 46 sorts data values of obtained array data, and samples the sorted data values at given intervals to give a given number of data values. A background candidate computing part 32 selects a plurality of background candidates. The value of each background candidate is subtracted from each sample data value, and each obtained subtraction value is transformed logarithmically. A difference computing/comparing processing part 38 computes a standard value of normal distribution corresponding to each logarithmic value, and about each background candidate, computes an index indicating the difference between each logarithmic value and the standard value. The range of background candidate values is narrowed in dependence on the indices, before the subtraction value and logarithmic value production, the difference indicating index computation and the background candidate value narrowing are repeated to determine a background value. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【産業上の技術分野】
本発明は、遺伝子発現データを統計的に解析する手法に関する。
【０００２】
【従来の技術】
遺伝子発現データを取得するために、ＤＮＡチップを利用することが知られている。ＤＮＡチップとは、スライドガラスなどの基材上に複数の遺伝子を異なるスポットとして固定させたものである。たとえば、マイクロアレイには、数千から数万の遺伝子がターゲットとして固定されている。ターゲットとして、一重鎖のＤＮＡやｍＲＮＡなどが利用される。
【０００３】
ＤＮＡチップの基材として、種々のコーティングを施したガラスなどからなる板、ナイロンやニトロセルロースからなる膜、中空糸、半導体材料、金属材料、有機物質など核酸を保持できる種々のものが利用できる。また、ターゲットとして、ｃＤＮＡの全部或いはその一部を複製したもの、ゲノムＤＮＡの一部を複製したもの、合成ＤＮＡおよび／または合成ＲＮＡが利用され得る。基材にターゲットを固定するために、フォトリソグラフ法によりガラス板上にオリゴＤＮＡを合成する手法と、スポッタ等を利用して基材にターゲットを取り付ける手法とが知られている。
【０００４】
このようなＤＮＡチップに、たとえば、蛍光標識をつけたＤＮＡやＲＮＡ（解析対象）をハイブリタイズさせる。ターゲットと相補的な解析対象が二重鎖を形成する。解析対象には蛍光標識が付されているため、ハイブリダイゼーションの後に、蛍光スキャナにてＤＮＡチップを操作した画像データを取得することができる。このようにして取得された画像データに基づき、何れかのスポットに二重鎖が形成されているかを知ることが可能となる。より具体的には、得られた画像は、ハイブリダイゼーションの結果、各々のＤＮＡに由来するスポットが表示される。したがって、スポットの位置を含む所定の領域のシグナル強度を積算することにより、各スポットのシグナル強度を示す値からなるアレイデータを得ることができる。
【０００５】
たとえば、数千から数万のターゲットが固定されているマイクロアレイにより、多数の遺伝子発現を示すアレイデータを一度の実験操作で得ることができる。この結果、ある一つの遺伝子発現のデータの増減を測定する際に、その対象として多数の遺伝子発現を示すデータ（シグナル強度を示す値）の平均を算出し、これに基づいてデータを標準化するのが一般的である。より具体的には、実験ごとの発現データを比較する前にデータを標準化する。たとえば、Ｊｏｈｈａｎｅｓ　Ｓｃｈｕｃｈｈａｒｄｔらによる「Ｎｏｒｍａｌｉｚａｔｉｏｎ　ｓｔｒａｔｅｇｉｅｓ　ｆｏｒ　ｃＤＮＡ　ｍｉｃｒｏａｒｒａｙｓ（Ｎｕｃｌｅｉｃ　Ａｃｉｄｓ　Ｒｅｓｅａｒｃｈ
（２０００）　Ｖｏｌ．２８　Ｎｏ．１０）」には、その標準化の一例が開示されている。
【０００６】
【発明が解決しようとする課題】
取得されたデータの確率分布はノンパラメトリックである。しかしながら、たとえば、Ｔｏｄｄ　Ｒｉｃｈｍｏｎｄらによる「Ｃｈａｓｉｎｇ　ｔｈｅ　ｄｒｅａｍ：　ｐｌａｎｔ　ＥＳＴ　ｍｉｃｒｏａｒｒａｙｓ　（Ｃｕｒｒｅｎｔ　Ｏｐｉｎｉｏｎ　ｉｎ　Ｐｌａｎｔ
Ｂｉｏｌｏｇｙ　（２０００）　Ｖｏｌ．３　ｐｐ１０８−１１６）」に開示されているように、取得されたデータを標準化するために、Ｚ−標準やｔ−標準、或いは、各スポットのシグナル強度の積算値を全体の数値の算術平均で除するというような手法が用いられている。
【０００７】
これらはノンパラメトリックな手法ではないため、このような標準化がデータの精度を著しく損ねているという問題点があった。
また、蛍光スキャナにより取得された画像に基づくアレイデータは、必ず、バックグラウンド成分を含む。これは、画像データ全体に存在するバックグラウンドのシグナル強度、および、測定範囲と実際のスポットの大きさや形状が必ずしも一致しないことに起因する。したがって、取得した画像データの数値からバックグラウンド成分を差し引き、真のシグナル値からなるデータを取得することが正確な解析のために重要となる。他の手法、たとえば、電気信号の検出、放射線の検出により取得されたアレイデータでも同様である。
【０００８】
従来、バックグラウンド成分を、特定のスポットやスポットされない部分のシグナル強度をあらわす数値に基づき、画素あたりの平均値や中央値を求め、この値に測定領域の画素数を乗ずることにより推定していた。
或いは、Ｍｉｃｈａｅｌ　Ｅｉｓｅｎが、「ＳｃａｎＡｌｙｚｅ　Ｕｓｅｒ　Ｍａｎｕａｌ（ｈｔｔｐ：／／ｒａｎａ．ｌｂｌ．ｇｏｖ／ＥｉｓｅｎＳｏｆｔｗａｒｅ．ｈｔｍ）」において提案しているように、スポットごとに、測定範囲の外側近傍の値からバックグラウンド成分を推定する手法も知られている。
しかしながら、上記従来の補正法においては、バックグラウンド値算出のために利用されるスポットや画像中の領域の相違により、上記バックグラウンドの推定値は変化する。つまり、上記相違から種々のバックグラウンド値が推定される可能性があり、何れが適切であるかを判断することができないという問題点があった。特に、ＤＮＡをスポットした領域と、そうでない領域との間で、バックグラウンド値の差が大きくなることがあった。
【０００９】
そこで、本発明者は、ＤＮＡチップから得られるデータ（遺伝子発現による発光量を示すデータ）の対数値が３パラメータ正規分布することを知見し、上記データを対数変換し、さらに標準化（たとえば、ｚ−標準化）することを提案した。上記手法により、異なる実験の結果や同種の実験結果を正確に比較することが可能となった。
本発明は、さらに、ＤＮＡチップなどから得られる遺伝子発現データに基づき、より精度の良い解析を施すことが可能なデータ処理方法を提供することを目的とする。
【００１０】
【課題を解決するための手段】
本発明の目的は、遺伝子の発現量に基づき得られたアレイデータ、たとえば、ＤＮＡチップやタンパクチップのハイブリダイゼーションなどにより、チップ上に配置された各スポットのシグナル強度を示す値から構成されるアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、複数のバックグラウンド候補を選択して、これを一時的に記憶手段に記憶するステップと、前記抽出されたデータ値のそれぞれから、各バックグラウンド候補の値を減じて、減算値を取得し、かつ、各減算値を対数変換した対数値を得て、当該対数値を一時的に記憶手段に記憶するステップと、前記対数値のそれぞれに対応する、正規分布の標準値を算出するステップと、前記各バックグラウンド候補について、各対数値と標準値との間の差異を示す指標を算出するステップと、前記指標に基づき、前記バックグラウンド候補の値の範囲を絞り込むステップと、前記減算値および対数値の取得、差異を示す指標の算出、バックグラウンド候補の値の絞込みを繰り返すことにより、バックグラウンド値を決定するステップと、前記決定されたバックグラウンド値に関連して一時的に記憶された対数値を、それぞれ標準化し、標準化された値を、それぞれ、記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法により達成される。
【００１１】
本発明によれば、ソートされた値の対数値と、対応する標準値との差異に基づいて、その差異が最小となるようなバックグラウンド値が定められるため、より適切なバックグランド値を決定することができ、その結果、他のデータとの比較を含む解析の対象となるデータをより適切なものとすることが可能となる。
【００１２】
なお、上記差異の指標として、差異の絶対値の総和、差異の二乗（二乗誤差）の総和、最小二乗法の「ｒ」などを利用することができる。前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出する際の所定間隔は、間隔が「０」であること、つまり、全てのデータを抽出することも含む。また、抽出されたｎ個のデータのうちの、第ｉ番目のデータ値に対応する標準値は、正規分布の第ｉ番目のｎ分位数とすれば良い。
【００１３】
また、本発明の目的は、遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、バックグラウンド値γを決定して、これを記憶手段に記憶するステップと、前記バックグランド値を減じたデータ値である減算値を対数化して、対数値を取得し、これを記憶手段に一時的に記憶するステップと、前記対数値を参照して、中心的傾向の特性値μおよび変動の特性値σを算出し、これらを記憶手段に記憶するステップと、各データ値ｘについて、標準値ｚとして、ｚ＝（ｌｏｇ（ｘ−γ）−μ）／σを算出して、算出された標準値ｚを、それぞれ記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法によっても達成される。
【００１４】
本発明によれば、算出されたパラメータγ、μおよびσを用いて、アレイデータのデータ値ｘを、それぞれ、ｚ＝（ｌｏｇ（ｘ−γ）−μ）／σと標準化し、より解析に適したものを得ることが可能となる。
好ましい実施対応においては、前記バックグラウンド値γを決定するステップが、複数のバックグラウンド候補を選択して、これを一時的に記憶手段に記憶するステップと、前記抽出されたデータ値のそれぞれから、各バックグラウンド候補の値を減じて、減算値を取得し、かつ、各減算値を対数変換した対数値を得て、当該対数値を一時的に記憶手段に記憶するステップと、前記対数値のそれぞれに対応する、正規分布の標準値を算出するステップと、前記各バックグラウンド候補について、各対数値と標準値との間の差異を示す指標を算出するステップと、前記指標に基づき、前記バックグラウンド候補の値の範囲を絞り込むステップとを有し、前記減算値および対数値の取得、差異を示す指標の算出、バックグラウンド候補の値の絞込みを繰り返すことにより、バックグラウンド値を決定するように構成されている。
【００１５】
より好ましい実施態様においては、前記中心的傾向の特性値μおよび変動の特性値σを求めるステップが、前記対数値のそれぞれに対応する標準値を算出するステップと、前記対数値と標準値とを比較し、両者の比がほぼ一定に推移する範囲を求めるステップと、前記標準値をｘ軸、対数値をｙ軸と考えた場合に、前記範囲において形成される直線の傾きおよびｙ切片を算出するステップと、算出されたｙ切片を中心的傾向の特性値μと決定し、傾きを変動の特性値σと決定するステップとを有する。ここでは、いわゆる正規確率プロット（Ｎｏｒｍａｌ　Ｐｒｏｂａｂｉｌｉｔｙ　ｐｌｏｔ：ＮＰＰ）を利用して、直線性が担保された領域を見出し、当該領域から導き出される直線の傾きおよび切片を、それぞれ、σおよびμと決定する。これにより、よりロバストな標準化を実現することが可能となる。
【００１６】
別の好ましい実施態様においては、さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替え、その順序で記憶手段に一時的に記憶するステップと、前記チップにおいてスポットが配置された列或いは行に関して、当該列或いは行ごとのデータ値の傾向を示す指標を算出するステップと、前記指標に基づき、列或いは行ごとに特徴がある場合に、各列或いは各行について、それぞれ、そのデータ値の中央値を算出するステップと、前記データ値を、対応する中央値で除して、除算値を取得して、これを記憶手段に一時的に記憶するステップとを備え、前記一時的に記憶された除算値を、アレイデータのデータ値に対応する値として、演算対象とする。
【００１７】
この実施態様によれば、アレイチップの精度に問題がある場合、特に、打刻機の精度の問題や、チップ自体のスポットに配置されるクローンの出自などにより、列や行が特異となっている場合であっても、その特異性を解消し、ロバストな標準化を施し得る状態にすることができる。
前記傾向を示す指標を算出するステップが、特定の列或いは行に関する移動平均を算出するステップを含んでいても良い。
【００１８】
また、別の好ましい実施態様においては、さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替え、その順序で記憶手段に一時的に記憶するステップと、前記順序で、データ値の周期性を見出すステップと、前記周期性のある場合に、各データ値から、当該周期の中心的傾向の特性値を減じて減算値を算出し、これを記憶手段に一時的に記憶するステップとを備え、前記一時的に記憶された減算値を、アレイデータのデータ値に対応する値として、演算対象とする。ここでは、アレイデータの値が、一定の周期性を持つ場合に、周期性をもつ要素を排除しておくことで、解析対象としてより適切なデータを得ることができる。
【００１９】
また、別の実施態様においては、さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替えるステップと、前記チップにおいてスポットが配置された列或いは行に関して、当該列または行ごとに、データ値の中心的傾向の特性値を算出するステップと、前記中心的傾向の特性値に基づき、当該列或いは行に属するスポットに関するバックグラウンド値を設定し、当該スポットに関するデータ値のそれぞれから、バックグラウンド値を減じて減算値を算出するステップと、前記減算値を、それぞれ対数化して、対数値を取得するステップと、前記列或いは行に関して、前記対数値の中心的傾向の特性値を減算し、前記減算値を、記憶手段に一時的に記憶するステップとを備え、前記一時的に記憶された減算値を、アレイデータのデータ値に対応する値として、演算対象とする。
【００２０】
さらに、本発明の目的は、遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、前記チップにおいてスポットが配置された列或いは行に関して、当該列または行ごとに、データ値の中心的傾向の特性値を算出するステップと、前記中心的傾向の特性値に基づき、当該列或いは行に属するスポットに関するバックグラウンド値の候補を設定し、当該スポットに関するデータ値のそれぞれから、バックグラウンド候補値を減じて減算値を算出するステップと、前記減算値を、それぞれ対数化して、対数値を取得するステップと、前記列或いは行に関して、前記対数値の中心的傾向の特性値を算出し、前記対数値のそれぞれから減じて第２の減算値を算出するステップと、前記列或いは行に関して、前記データ値を、前記第２の減算値に基づき算出される変動の特性値で除して、除算値を取得し、これを記憶手段に一時的に記憶するステップと、前記除算値と、対応する標準値とを比較し、これらの間の差異の指標が最も小さくなるような、バックグラウンド候補値をバックグラウンド値γと決定するステップと、前記バックグラウンド値γ、当該バックグラウンド値γと関連する中心的傾向の特性値μおよび変動の特性値σを、それぞれ記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法によっても達成される。
【００２１】
本発明によれば、列或いは行ごとの中心的傾向の特性値に基づいてバックグランド値が決定される。たとえば、列ごとのバックグラウンド値は、当該列の中心的傾向の特性値の、ある比例定数倍と考えることができる。これにより、列や行の特異性を排除することが可能となる。
【００２２】
また、本発明の目的は、遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、ソートされたデータを、記憶手段に一時的に記憶するステップと、前記ソートされたデータ値に対応する、正規分布の標準値を算出するステップと、前記データ値に関して、その変動の特性値ｓを設定して、これを記憶手段に記憶するとともに、前記標準値のそれぞれに乗じて、乗算値を得るステップと、前記データ値と乗算値とを比較し、両者の比が一定に推移する範囲を求めるステップと、前記乗算値をｘ軸、対数値をｙ軸と考えた場合に、前記範囲において形成される直線の傾きおよびｙ切片を算出するステップと、前記傾きの自然対数を中心的傾向の特性値ｕ、切片をバックグラウンド値ｇと決定して、これらを記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法によっても達成される。
【００２３】
たとえば、ウェット実験の不良などが原因で、ハイブリダイゼーション全体のノイズレベルが高くなり、そのレベルが無視できない場合に、チップとサンプルのデータの組み合わせから、ノイズがなければ対数正規分布となることが期待できる場合には、上記手法を利用した標準化を適用することができる。
ここでは、さらに、ｘｉ＝（１０^ｕ）＊（１０^{（ｓ＊Ｚｉ）}）＋ｇ
（ただし、Ｚｉは、第ｉ番目の標準値）を用いて、ｘｉを解き、これを、記憶手段に一時的に記憶するステップと、前記ｘｉとして利用することができる値の下限値を求め、これを前記記憶手段に記憶するステップとを備えているのが望ましい。これにより、解析対象として利用できるデータの範囲を知ることができる。
【００２４】
また、本発明の目的は、遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、複数のバックグラウンド候補を選択して、これを一時的に記憶手段に記憶するステップと、前記抽出されたデータ値のそれぞれから、各バックグラウンド候補の値を減じて、減算値を取得し、かつ、各減算値を対数変換した対数値を得て、当該対数値を一時的に記憶手段に記憶するステップと、前記対数値のそれぞれに対応する、正規分布の標準値を算出するステップと、前記各バックグラウンド候補について、各対数値と標準値との間の差異を示す指標を算出するステップと、前記指標に基づき、前記バックグラウンド候補の値の範囲を絞り込むステップと、前記減算値および対数値の取得、差異を示す指標の算出、バックグラウンド候補の値の絞込みを繰り返すことにより、バックグラウンド値を決定するステップと、前記決定されたバックグラウンド値に関連して一時的に記憶された対数値を、それぞれ標準化し、標準化された値を、それぞれ、記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラムにより達成される。
【００２５】
さらに、本発明の目的は、遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、バックグラウンド値γを決定して、これを記憶手段に記憶するステップと、前記バックグランド値を減じたデータ値である減算値を対数化して、対数値を取得し、これを記憶手段に一時的に記憶するステップと、前記対数値を参照して、中心的傾向の特性値μおよび変動の特性値σを算出し、これらを記憶手段に記憶するステップと、各データ値ｘについて、標準値ｚとして、ｚ＝（ｌｏｇ（ｘ−γ）−μ）／σを算出して、算出された標準値ｚを、それぞれ記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラムによっても達成される。
【００２６】
或いは、本発明の目的は、遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、前記チップにおいてスポットが配置された列或いは行に関して、当該列または行ごとに、データ値の中心的傾向の特性値を算出するステップと、前記中心的傾向の特性値に基づき、当該列或いは行に属するスポットに関するバックグラウンド値の候補を設定し、当該スポットに関するデータ値のそれぞれから、バックグラウンド候補値を減じて減算値を算出するステップと、前記減算値を、それぞれ対数化して、対数値を取得するステップと、前記列或いは行に関して、前記対数値の中心的傾向の特性値を算出し、前記対数値のそれぞれから減じて第２の減算値を算出するステップと、前記列或いは行に関して、前記データ値を、前記第２の減算値に基づき算出される変動の特性値で除して、除算値を取得し、これを記憶手段に一時的に記憶するステップと、前記除算値と、対応する標準値とを比較し、これらの間の差異の指標が最も小さくなるような、バックグラウンド候補値をバックグラウンド値γと決定するステップと、前記バックグラウンド値γ、当該バックグラウンド値γと関連する中心的傾向の特性値μおよび変動の特性値σを、それぞれ記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラムによっても達成される。
【００２７】
また、本発明の目的は、遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、ソートされたデータを、記憶手段に一時的に記憶するステップと、前記ソートされたデータ値に対応する、正規分布の標準値を算出するステップと、前記データ値に関して、その変動の特性値ｓを設定して、これを記憶手段に記憶するとともに、前記標準値のそれぞれに乗じて、乗算値を得るステップと、前記データ値と乗算値とを比較し、両者の比が一定に推移する範囲を求めるステップと、前記乗算値をｘ軸、対数値をｙ軸と考えた場合に、前記範囲において形成される直線の傾きおよびｙ切片を算出するステップと、前記傾きの自然対数を中心的傾向の特性値ｕ、切片をバックグラウンド値ｇと決定して、これらを記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラムによっても達成される。
【００２８】
ＤＮＡチップの基材として、種々のコーティングを施したガラスなどから作られた板、ナイロンやニトロセルロースなどを基材とする膜、中空糸、半導体、金属、有機物質など、表面に核酸を保持できる任意のものを利用できる。また、ＤＮＡチップ上には、ターゲットとして、ｃＤＮＡの全部或いは一部の複製、ゲノムＤＮＡの複製、合成ＤＮＡ、合成ＲＮＡなどが配置される。
【００２９】
また、チップを作製するには、核酸を用意しておき、これを、吸着、静電気による結合、共有結合により基材上に配置する手法や、基材上で核酸を合成する手法がある。シグナル強度を示す信号の検出には、半導体チップを利用した電気的な手法、蛍光や放射能を検出する手法などが含まれる。
【００３０】
本発明は、上記何れの基材の上に何れのターゲットが形成されたＤＮＡチップからのアレイデータにも適用することができる。また、何れの手法を用いて取得したアレイデータに対しても適用することができる。また、固定化されたＤＮＡなどの遺伝子を固定化したマイクロビーズなど、他の媒体から得られたデータについても同様である。
【００３１】
なお、本明細書において、ＤＮＡチップとは、基材上にＲＮＡを形成したＲＮＡチップ、マイクロアレイ、マクロアレイ、ドットブロット、リバースト・ノーザンなど、基材の上に核酸が配置された任意のものを含む。
【００３２】
【発明の実施の形態】
以下、添付図面を参照して、本発明の実施の形態につき説明を加える。図１は、本発明の第１の実施の形態にかかる解析装置のハードウェア構成図である。図１に示すように、解析装置１０は、ＣＰＵ１２と、マウスやキーボードなどの入力装置１４と、ＣＲＴなどから構成される表示装置１６と、ＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）１８と、ＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）２０と、ＣＤ−ＲＯＭやＤＶＤ−ＲＯＭなどの可搬記憶媒体２３をアクセスする可搬記憶媒体ドライバ２２と、ハードディスク装置２４と、外部とのデータ授受を制御するインタフェース（Ｉ／Ｆ）２６とを備えている。図１から理解できるように、本実施の形態にかかる解析装置１０として、パーソナルコンピュータなどを利用することができる。
【００３３】
Ｉ／Ｆ２６は、ハイブリタイズされたＤＮＡチップ上のスポットの発光量を計測して、計測された発光量に基づくデータを生成するリーダまたはスキャナ（図示せず）や通信回路に接続されている。通信回路は、さらに、外部ネットワーク（たとえば、インターネット）に接続されている。
本実施の形態において、可搬記憶媒体２３には、リーダまたはスキャナからのデータを受け入れて、当該データに対して後述する必要なデータ変換処理を実行するプログラム、および、処理が施されたデータを解析するためのプログラムが記憶されている。したがって、可搬記憶媒体ドライバ２２が、可搬記憶媒体２３から、上記プログラムを読み出して、これをハードディスク装置２４に記憶して、これを起動することにより、パーソナルコンピュータが、解析装置１０として作動することが可能となる。或いは、インターネットなどの外部ネットワークを介して、上記プログラムをダウンロードしても良い。
【００３４】
図２は、第１の実施の形態にかかる解析装置１０の要部の機能ブロックダイヤグラムである。図２においては、遺伝子発現データの解析結果を導き出すための処理を実行する構成部分が示されている。図２に示すように、解析装置１０は、データバッファ３０と、データバッファ３０に一時的に記憶されたデータ（原でーた）に基づき、ＤＮＡチップ上のスポットの発光量のうち、ノイズ成分に対応するバックグラウンド値の候補を算出するバックグラウンド候補算出部３２と、原でーたに対して所定の前処理を施すとともに、バックグラウンド候補の値と原データとの間で演算を施す前処理部３４と、演算を施されたデータに対して後述する変換を施すとともに、変換されたデータを標準化する変換／標準化処理部３６と、標準化された値と理想値との間の差異を算出し、また、複数のバックグラウンド候補のそれぞれの差異を比較するとともに、比較結果に基づくグラフの補正値を算出する差異算出／比較処理部３８と、ユーザに提示する画像を形成する画像形成処理部４０と、得られた種々のデータを記憶する結果記憶部４２とを有している。
【００３５】
前処理部３４は、原データに関して、ＤＮＡチップの列や位置（領域）による規則性があるような場合に、そのランダムネスを高めるための処理を施すデータ補正部４４、および、必要に応じてデータ補正部４４にて補正されたデータをソートして、ソートされたデータ群から所定のものを抽出するソート／抽出処理部４６とを有している。
【００３６】
データバッファ３０は、ＲＡＭ１８、場合によってはハードディスク装置２４によりその機能が実現される。データバッファには、リーダまたはスキャナから伝達された、各スポットの発光量を示すデータ、或いは、リーダまたはスキャナから伝達されハードディスク装置２４の所定の領域に予め記憶されていた、各スポットの発光量を示すデータが一時的に記憶される。また、データバッファ３０は、バックグランド候補算出部３２にて算出されたバックグラウンド候補値や、前処理部３４にて処理が施されたデータ、場合によっては、対数変換されたデータや演算に利用する標準値ないし理想値などを一時的に記憶することもできる。
【００３７】
リーダまたはスキャナからは、ＤＮＡチップをＣＣＤカメラなどで撮影し、スポットごとのシグナル強度を積算したものが、アレイデータとして出力される。或いは、リーダまたはスキャナにおいて、ＣＣＤカメラにて撮影した画像の画像データの値に基づき、バックグラウンド値が決定され、各画素のシグナル強度からバックグラウンド値が差し引かれ、既にバックグラウンド補正がなされた画像データから、スポットごとのシグナル強度が積算されて、アレイデータとして出力される場合もある。本実施の形態においては、未処理のアレイデータ、上記リーダやスキャナまたは付随するソフトウェアにより補正処理（バックグラウンド補正）が施されたデータの何れをも利用することができる。なお、本明細書において、リーダまたはスキャナから伝達される、上記スポットごとのシグナルを累算したデータを、アレイデータ、或いは、本実施の形態にかかるバックグラウンド処理を施すための基礎となるデータという意味で原データと称する。
【００３８】
上記解析装置１０におけるＤＮＡチップに現れた発光量を示すデータに基づく、他のデータとの比較が可能な指標を算出する処理につき、以下に詳細に説明を加える。図３は、本実施の形態にかかる解析装置１０による処理の概略を示すフローチャートである。図３に示すように、まず、解析装置１０は、データバッファ３０からあるＤＮＡチップにかかる原データを取得し（ステップ３０１）、これに対して、前処理を施す（ステップ３１０参照）。本実施の形態において、前処理には、原データの状態に基づいて必要に応じて実行される任意的な初期的補正処理（ステップ３０２）、取得した原データのソート処理（ステップ３０３）、および、ソートされたデータ群において、所定順位に位置するデータ値の抽出（ステップ３０４）が含まれる。初期的補正処理については、後に詳述する。
【００３９】
ソート処理により、値が昇順或いは降順で並べ替えられたデータに対して、前処理部３４中のソート／抽出処理部４６は、所定の間隔の順位に位置しているデータを抽出する。たとえば、大きい順から１０番目、２０番目、３０番目、・・・というように所定間隔で、所定の順位の値を抽出しても良い。或いは、第１百分位数、第２百分位数、・・・というように、所定の分位数を抽出しても良い。ソートされたデータや抽出されたデータは、データバッファ３０の所定の領域に記憶される。
【００４０】
次いで、バックグラウンド値が算出され（ステップ３０５）、また、他のパラメータが算出される（ステップ３０６）。本実施の形態においては、ＤＮＡチップから得られるデータ（遺伝子発現による発光量を示すデータ）の対数値が正規分布するという知見、および、これをｚ−標準化することにより、異なる実験の結果や同種の実験結果を正確に比較することが可能となることに基づき、あるＤＮＡチップのデータから、よりロバストに標準化されたデータ群を求めている。
【００４１】
ここで、本実施の形態においては、
ｚ＝（ｌｏｇ（ｘ−γ）−μ）／δのうち、算出されたバックグラウンド値をγとし、また、後述する演算により、残りのパラメータμおよびδを算出している。まず、バックグラウンド値の算出について、より詳細に説明し、その後、残りのパラメータの算出について詳細に説明する。
【００４２】
図４は、バックグラウンド値算出処理（ステップ３０５）をより詳細に示すフローチャートである。バックグラウンド候補算出部３２は、オペレータの入力装置の操作等による入力にしたがって、バックグラウンド値の候補（バックグランド候補値）の範囲、および、当該範囲中の複数のバックグラウンド候補値を決定する。たとえば、ユーザがバックグラウンド候補値の始点（たとえば、「０（ゼロ）」）と、終点（たとえば、中央値や、第１四分位数）が指定されると、始点と終点との間で等間隔な（或いは等比的な）所定数の値が決定される。たとえば、「０」と中央値が指定された場合に、その間で等間隔に８つの値が取られ、始点および終点を含めて、１０個のバックグラウンド候補値が決定される。本処理において、バックグラウンド候補値は、データバッファ３０に記憶され、また、必要に応じて、値が読み出され、また、更新される。
【００４３】
次いで、抽出された原データの値（原データ値）から、あるバックグラウンド候補値が減じられ（ステップ４０２）、変換／標準化処理部３６により、バックグランド候補値が減じられた原データ値が対数変換される（ステップ４０３）。ここで取得された対数変換されたデータも、後の処理に利用するため、データバッファ３０に記憶される。ステップ４０２および４０３は、選択された全て（たとえば、１０個）のバックグランド候補値に関して実行される。
【００４４】
次いで、あるバックグラウンド候補値に関する対数変換されたデータ値（変換値）と、以下の手法で算出され、かつ、データバッファ３０に記憶されている、対応する標準値とが比較され、値の差異を表す指標が算出される（ステップ４０４）。ここで、本実施の形態においては、標準値を以下のように求めている。
分位数が幅を有しているため、統計的な中央値を補正するために、以下の数値を算出する。
ｍ（ｉ）＝（ｉ−０．３１７５）／（ｎ＋０．３６５）
ここに、ｎ：データ個数、ｉ：１からｎまでの自然数
【００４５】
次いで、求められたｍ（ｉ）のそれぞれについて、正規分布関数の逆関数Ｆ^−１（ｒ）を施す。求められた値のそれぞれが、データ値に対応する標準値となる。
次いで、差異算出／比較処理部３８により、各バックグラウンド候補値について、たとえば、差異（データ値と標準値との差）の絶対値の総和、或いは、差異の二乗の総和が算出される。ここで求められた値が、各バックグラウンド候補値の差異指標となる。無論、差異指標として、最小二乗法の「ｒ」を利用しても良い。実際に、最小二乗法の「ｒ」を利用した場合が、精度の高いバックグラウンド値を求めるという観点から望ましい。
【００４６】
次いで、差異算出／比較処理部３８は、たとえば、バックグラウンド候補値を横軸、差異指標を縦軸としたグラフを生成し、これを表示装置１６の画面上に表示する（ステップ４０５）。
【００４７】
オペレータは、表示装置１６の画面上に表示されたグラフを参照して、望ましいバックグラウンド候補値の範囲、或いは、バックグラウンド値を選択する（ステップ４０６）。選択した値が、バックグラウンド値として十分に満足なものと考えられれば（ステップ４０７でイエス（Ｙｅｓ））、処理は終了する。その一方、十分に満足なものではない場合には、新たに選択された、より狭められたバックグラウンド候補値の範囲から、所定数の新たなバックグラウンド候補値が決定され（ステップ４０８）、ステップ４０２〜４０７の処理が繰り返される。新たなバックグラウンド候補値も、バックグラウンド候補値の範囲の始点と終点との間を等間隔に分割したような値であっても良いし、等比的に分割したような値であっても良い。最終的に得られたバックグラウンド値は、結果記憶部４２に記憶される。
【００４８】
たとえば、図１２に示すように、バックグラウンド候補の値の範囲が横軸、差異の指標が縦軸であるようなグラフが生成される。図１２の例では、バックグランド候補値として、１８００から２７００までの１００きざみの値（１８００、１９００、２０００、・・・、２７００）を採用している。観察者は、これを参照して、バックグラウンド候補の範囲を絞り込み、再度、新たな範囲でのバックグラウンド候補の値に関する差異の指標を取得することができる（図１３参照）。図１３の例では、バックグラウンド値として、「２３６３」とするのが、この時点で、もっとも適切なものであると理解することができる。
【００４９】
次に、残りのパラメータ算出のための処理につき説明を加える。一般に、対数正規分布においては、対数をとったデータのμ（中心的傾向の特性値）として平均値、δ（変動の特性値）として標準偏差が利用される。しかしながら、ＤＮＡチップから得られるデータにおいては、強いシグナル（比較的データ値の大きなもの）は正確に、弱いシグナル（比較的データ値の小さなもの）は相対的に大きなノイズを含む。ノイズに隠されて負値となったデータは、対数値を求めることができないため、これら弱いシグナルの多くは切り捨てられることになる。このような場合に、上記算出方法を利用することができない。
【００５０】
通常、中心的傾向の特性値として平均値をもって求められる。ところが、平均はいわゆるロバストな手法ではなく、特に、弱いシグナルが選択的に抜け落ちる状況では高めに算出される。こうした場合には、中央値がより有効であることは知られている。
その一方、変動の特性値は標準偏差をもって表される。しかしながら、標準偏差もロバストな手法ではなく、上述したような、弱いシグナルが選択的に抜け落ちる状況では小さめに算出される。その一方、ロバストな手法として、変動の特性値を、四分位レンジから求めるｉｑｒが知られている（たとえば、ｈｔｔｐ：／／ｉｎｆｏｓｈａｋｏ．ｓｋ．ｔｓｕｋｕｂａ．ａｃ．ｊｐ／ＩｎｆｏＲｅｓ／ｊｄｏｃ／ＭＡＴＬＡＢ５／ｊｈｅｌｐ／ｔｏｏｌｂｏｘ／ｓｔａｔｓ／ｉｑｒ．ｈｔｍｌ参照）。
【００５１】
しかしながら、中央値はデータ群のうちの一点、ｉｑｒもデータ群のうちの二点から求めるもので、その精度に難点がある。特に、少ないスポット数から取得したデータや、補正のためのデータ数に限りがある場合には、その問題が深刻となる。そこで、本実施の形態においては、以下の手法により、比較的データ数に限りのある場合でも、精度の高いパラメータ算出法を採用している。
【００５２】
図５は、本実施の形態にかかるパラメータ算出処理を示すフローチャートである。図５に示すように、まず、理想値およびバックグラウンド値が減じられた実測値が取得される（ステップ５０１）。理想値は、先のステップ４０４で求めた標準値と同様である。次いで、理想値（理論値）を横軸にとり、実測値に基づくデータ値を縦軸にとったグラフが作成され、表示装置の画面上に表示される（ステップ５０２）。このグラフにおいて、実測値が正確に正規対数分布していれば、このグラフは、ｙ＝ｘにほぼ一致する。しかしながら、実際には図１４に示すように、実測値をプロットしたことにより得られたグラフは、１以外の傾き（＝ａ：図１４ではａ≒０．５６）およびｙ切片（＝ｂ；図１４ではｂ≒２．８０）をもち、かつ、ｘの値が比較的小くなる部分では、直線性を失う。
【００５３】
しかしながら、図１４のグラフにおいても、ほぼ直線と認められる部分が存在する（たとえば、ｘが正となる部分）。そこで、本実施の形態においては、ユーザがグラフを参照して、入力装置を操作して、直線性を持っていると判断する範囲を指定すると（ステップ５０３）、指定された範囲の実測値を用いて、当該実測値と理論値との間を表す１次式が、たとえば、最小二乗法により求められる。求められた１次式「ａｘ＋ｂ」における傾き「ａ」が、変動の特性値「σ」に対応し、ｙ切片「ｂ」が、中心的傾向の特性値「μ」に対応する（ステップ５０４）。
【００５４】
たとえば、解析装置１０の画像形成処理部４０が、求められた「ａ」および「ｂ」を用いて、理想値を横軸にとり、実測値ｚ＝（ｌｏｇ（ｘ−γ）−μ）／δを縦軸にとったグラフを生成して、これを表示装置３８の画面上に表示しても良い。図１５は、図１４におけるプロットされた値について、μを減じた後に、σで除した値を再度プロットしたグラフの例である。ユーザは、表示されたグラフを参照して、満足のいくものでなければ（ステップ５０５でノー（Ｎｏ））、もとのグラフ中の範囲指定に戻り、ステップ５０３以降の処理が再度実行される。
【００５５】
その一方、満足のいくものであれば（ステップ５０５でイエス（Ｙｅｓ））、先に求められたバックグラウンド値が「γ」、切片が「μ」、傾きが「σ」として、ＤＮＡチップを特定する情報と関連付けられて、結果記憶部４２に記憶される。このようにして取得されたパラメータを用いて、ＤＮＡチップから得られたデータ値ｘのそれぞれについて、
ｚ＝（ｌｏｇ（ｘ−γ）−μ）／σ
という式を用いて、標準化することが可能となる。
【００５６】
このように、本実施の形態によれば、適切なバックグラウンド値を算出して、ノイズの影響を排除し、かつ、標準化のための中心的傾向の特性値および変動の特性値を、実測値をプロットしたグラフの直線部分から求める。これにより、よりロバストな標準化を実現することが可能となる。
【００５７】
次に、本実施の形態にかかる初期的補正処理（ステップ３０２）につき、より詳細に説明を加える。本実施の形態においては、ＤＮＡチップからのデータの特性によって、２種類の補正を施すことができるようになっている。
ＤＮＡチップは、ＤＮＡをガラスなどの表面に打刻するなどの方法で形成されている。この際に、打刻機（アレイヤないしスポッタ）の精度の問題から、ある規則を持ってデータ値が「強め」或いは「弱め」に出ることがあった。
このような傾向は、アレイヤのピンごと、或いは、スポットされたグリッドの横列ごと、或いは、ＤＮＡ試料を保持するマイクロタイタープレートのグリッド列・行ごとに出ることがあった。
【００５８】
たとえば、グリッドの横一列ごとにデータの強弱に特徴がある場合に、横一列の単位でデータを標準化することが考えられる。しかしながら、この場合に、一つのデータの集合を構成するデータ数ｎが小さくなる（たとえば、３２個）。このように少ない数のデータからバックグラウンド値を予測し、また、中心的傾向の特性値および変動の特性値を算出すると、その精度は著しく低くなる。ランダムな数の平均値のもつ標準偏差は、ｎの平方根の逆数に比例することが知られている。これは少数のデータから中心的傾向の特性値を正確に予見することが困難であることを示している。
【００５９】
そこで、初期的補正処理においては、ＤＮＡチップの横列や縦行の移動平均を算出し、列や行ごとに特性を持っていれば、当該列ごとに値を補正する（第１の前処理：符号６００参照）。また、それ以外の場合であっても、スポットごとの値の変遷が周期性を持っていれば、周期性を考慮したデータ補正を実行している（第２の補正処理：図７参照）。
【００６０】
以下、横列について説明を加えるが、縦行においても同様の処理が実行され得ることは言うまでもない。まず、ＤＮＡチップをスポッタが作製する際に、実際にスポットした順にデータを並べておき、そのデータ群のうち、ＤＮＡチップ上のある列、および、その前後所定数の列（たとえば、前後２列）に関するデータ値の平均値が算出される（ステップ６０１、６０２）。平均値の算出は列の末尾まで繰り返され（ステップ６０３、６０４参照）、その後、列ごとの平均値に特徴があるか否かが判断される（ステップ６０５）。図１６は、あるＤＮＡチップから取得したデータの対数値について、スポットごとの対数値値およびその移動平均値を示すグラフである。図１６に示す例では、ＤＮＡチップは、横一列で３２個のスポットを有している。前後所定数の列のデータ値をもって平均値をとることにより、もとのデータ値がランダムであれば、上記平均値はほぼ一致する。図１６において、実線にて示すスポットごとの対数値のグラフでは、値の傾向を見ることはできないが、ある列の３２個のスポットに対応するデータの対数値の平均値は、破線にて示すように、大きくばらついている。このような場合にはＤＮＡチップの列ごとに特徴があると判断され（ステップ６０５でイエス（　Ｙｅｓ））、第１の前処理がデータ値に施される。
【００６１】
なお、ステップ６０５において、移動平均値のばらつきが有意であるか否かにつき検定を行ってもよい。
第１の前処理においては、ＤＮＡチップの列のスポットに対応するデータ値の中央値が求められ（ステップ６０７）、当該列のスポットに対応するデータ値が、それぞれ中央値で除算される（ステップ６０７）。これが各列について実行される（ステップ６０９、６１０参照）。
【００６２】
次に、第２の前処理につき説明を加える。ここでは、各スポットに対応するデータ値が振動しているか否かを考慮した補正を施す。まず、スポットの順に並べられたデータ値が取得され（ステップ７０１）、データ群に対してＦＦＴ（Ｆａｓｔ　Ｆｏｕｒｉｅｒ　Ｔｒａｎｓｆｅｒ）処理が実行される（ステップ７０２）。ＦＦＴの結果、周期性のある成分（信号成分）があれば、各データ値から、周期を考慮して、その位相に対応する成分の値が減じられる（ステップ７０３、７０４）。オペレータは、満足のいく結果が得られるまで、ステップ７０３、７０４の処理を繰り返させても良い。第１の補正処理或いは第２の補正処理が施されたデータは、データバッファ３０に記憶される。このデータに対して、データソート（図３のステップ３０３参照）以下の処理が施される。
このように、本実施の形態にかかる初期的補正処理によれば、スポット作製の際の規則性を排除することが可能となる。
【００６３】
次に、本発明の第２の実施の形態につき説明を加える。第２の実施の形態においては、第２の実施の形態においては、周期性の排除をしつつ、適切なパラメータを算出している。図８および図９は、第２の実施の形態にかかる処理の概略を示すフローチャートである。第２の実施の形態においても、図６を参照して説明した初期的補正と同様に、あらかじめ、ＤＮＡチップをスポッタが作製する際に、実際にスポットした順にデータを並べておく。また、横列に限定されず、縦行についても同様の処理を実行し得ることも、図６の例と同様である。
【００６４】
この処理においては、所定の列のデータが取得され（ステップ８０１）、当該列のデータ値から、その列の中心的傾向の特性値が算出される（ステップ８０２）。ここでは、中央値を用いても良いし、或いは、上限および下限を除去した残りのデータ値の対数値の平均値から求めても良い。次いで、当該列のバックグラウンド値が設定される（ステップ８０３）。設定されるバックグラウンド値は、ステップ８０２にて求められた中心的傾向の特性値に比例すると考える。つまり、バックグラウンド値は、ある列の中心的傾向の特性値Ｍｉ（ｉは、列の番号）に対して、αＭｉと考える。
【００６５】
次いで、バックグラウンド値が減じられたデータ値が、それぞれ対数化される（ステップ８０４、８０５）。なお、データ値が、バックグランド値以下である場合には、当該値を対数値に変換することができない。このようなデータについては、測定限界以下として、表示装置の画面上に表示するのが望ましい。その後、対数値から、中央的傾向の特性値Ｍｉ或いは中央的傾向の特性値からバックグラウンド値が引いたものを減算される（ステップ８０６）。さらに、減算された値に関して、変動の特性値（第２の特性値）が設定され、減算された値が第２の特性値で除算される（ステップ８０７）。なお、変動の特性値は、たとえば、対応する標準値をｘ軸に、除算値をソートしたものをｙ軸にとったグラフを作成し、プロットされた点のうち、ある範囲（たとえば、上位６０％から９０％の範囲）がｙ＝ｘに最も近似するような値を、変動の特性値（第２の特性値）σとするのが望ましい。
【００６６】
つまり、ステップ８０１からステップ８０８により、ある列ｉに関して、
（ｌｏｇ（Ｘ−αＭｉ）−Ｍｉ）／σ
が算出されることになる。このような処理が、それぞれの列について実行される（ステップ８０９、８１０）。また、これらのデータ値は、データバッファ３０に一時的に記憶される。
【００６７】
その後、一時的に記憶されたデータ値がソートされ、対応する標準値と比較される（ステップ９０１、９０２）。ここでも、対応する標準値をｘ軸、ソートされたデータ値をｙ軸としたグラフを生成し、プロットされた点がｙ＝ｘに近似しているか否かを判断し、十分であれば（ステップ９０３でイエス（Ｙｅｓ））、それぞれの列のバックグラウンド値（αＭｉ）、中心的傾向の特性値（Ｍｉ）および変動の特性値（σ）が、結果記憶部４２に記憶される（ステップ９０４）。なお、十分か否かは、対応する標準値とデータ値の差異の二乗（二乗誤差）の総和や、差異の絶対値の総和から判断しても良い。
【００６８】
差異が所定の範囲を超えている場合（つまり、プロットされた点を結ぶ線が、ｙ＝ｘから所定以上逸脱している場合）には（ステップ９０３でノー（Ｎｏ））には、再度、比例定数αを変更し、また、それに伴って、変動の特性値σを変更して、ステップ８０１からの処理を実行する。この実施態様によれば、ＤＮＡチップの第ｉ列に関するバックグラウンド値を、αＭｉとすること、および、中心的傾向の特性値をＭｉとすることで、列ごとに特異な値となるような、チップの製造ムラを解消することが可能となる。
【００６９】
次に、本発明の第３の実施の形態につき説明を加える。第１の実施の形態においては、実際にＤＮＡチップから得られたデータ値（実測値）ベースで、バックグラウンド値（γ）、中心的傾向の特性値（μ）、および、変動の特性値（σ）を算出している。しかしながら、中央値へのノイズが無視できない場合も考えられる。つまり、たとえば、ウェット実験の不良などの原因で、ハイブリダイゼーション全体のノイズレベルが高くなることがある。ノイズレベルが、中央値に迫る程度のものになると、第１の実施の形態にかかるロバストな手法も適用することが困難となる。ここで、ノイズとは、個々のデータに含まれる偶然に起因する成分をいい、測定誤差やスポット量の誤差などが成因と考えられる。ノイズとは、シグナルに対応する概念であり、ＤＮＡチップから得られる生のデータは、ノイズとシグナルとの和と考えることができる。また、バックグラウンドとは、個々のデータのシグナルに含まれる、サンプル中のＲＮＡに由来しない部分と定義できる。したがって、シグナルは、ＲＮＡ由来部分とバックグラウンドとの和と捉えることができる。
【００７０】
上述したように、ノイズレベルが高い場合であっても、上位のデータは、対数正規分布の性質から、ノイズレベルよりも十分に大きいことになる。これらデータは、適切な、本来の中心的傾向の特性値を見つけることができれば解析可能になるはずである。バックグラウンドを取得できれば、トライアンドインプルーブの手法で、上記中心的傾向の特性値を見出すことも可能である。しかしながら、バックグラウンドと、中心的傾向の特性値との関係は不明である。本発明にて導入した３つのパラメータを用いた対数正規分布のうち、２つのパラメータを上記手法で見出すことは、計算量や、一義的に求まらない解の選択の問題のため、困難である。
【００７１】
そこで、チップとサンプルとの組み合わせから、ノイズがなければ対数正規分布となることが期待できる場合には、以下の手法により、値を得ることが可能となる。図１０は、第３の実施の形態にかかる処理を示すフローチャートである。第３の実施の形態においては、図１０に示すように、データバッファ３０から、ＤＮＡチップにかかる原データが取得され（ステップ１００１）、当該データがソート処理により、値が昇順或いは降順となるように並べ替えられる（ステップ１００２）。ソートされたデータも、データバッファ３０に記憶される。次いで、ソートされたデータ値のそれぞれに、理想的な対数正規分布の値Ｚｉ（ｉ＝１，２，・・・）が割り当てられる（ステップ１００３）。この理想値Ｚｉは、第１の実施の形態における標準値の算出（ステップ４０３参照）とほぼ同様の手法により求めることができる。再度、簡単に説明すると、まず、以下に示すｍ（ｉ）が算出される。
ｍ（ｉ）＝（ｉ−０．３１７５）／（ｎ＋０．３６５）
ここに、ｎ：データ個数、ｉ：１からｎまでの自然数
【００７２】
次いで、求められたｍ（ｉ）のそれぞれについて、正規分布関数の逆関数Ｆ^−１（ｒ）を施す。求められた値のそれぞれが、データ値に対応するＺｉとなる。この標準値も、後の処理に利用されるため、データバッファ３０に記憶される。このようにして、理想値Ｚｉが求められると、各Ｚｉに予想される変動の特性値（ｓ）が乗じられる（ステップ１００４）。なお、変動の特性値は、実験ごとにばらつかないと考えることもできるため、ある程度予想することもできる。
【００７３】
次いで、乗算により得られた値の１０のべき乗（つまり、１０^{（ｓ＊Ｚｉ）}）をｘ軸、実測値ｘ_ｉをｙ軸としたグラフが生成される（ステップ１００５）。このグラフにおいて、直線部分が信頼できる領域（信頼域）であると考えることができる。そこで、たとえば、ユーザが表示されたグラフを参照して、直線部分を選択（その範囲を指定）すると（ステップ１００６）、グラフの切片および傾きが算出される（ステップ１００７）。得られた傾きを対数化したものが、中心的傾向の特性値（ｕ）として、また、切片がバックグラウンド値（ｇ）として記憶される。
【００７４】
このようにして得られた中心的傾向の特性値およびバックグラウンドの有効性について以下に簡単に説明を加える。本発明にかかる３パラメータによる標準化（Ｚ標準化）では、Ｚｉは以下の式で表すことができる。
Ｚｉ＝｛ｌｏｇ（ｘｉ−ｇ）−ｕ｝／ｓ
Ｚｉは理想値、ｘｉは対応する実測値、ｇ、ｕ、ｓは、それぞれ、バックグラウンド値、中心的傾向の特性値、変動の特性値である。
上記式をｘｉについて解くと、
ｘｉ＝（１０^ｕ）＊（１０^{（ｓ＊Ｚｉ）}）＋ｇ
となる。（１０^{（ｓ＊Ｚｉ）}）をｘ軸に、ｘ_ｉをｙ軸として値をプロットすれば、一定の範囲が直線状となった線が得られる。この直線において、１０^ｕが傾きであるため、傾きの対数をとれば、中心的傾向の特性値ｕを得ることができる。上述したように取得されたバックグラウンド値ｇ、中心的傾向の特性値ｕおよび変動の特性値ｓは、それぞれ、結果記憶部４２に記憶される（ステップ１００８）。
【００７５】
なお、第３の実施の形態においては、先に述べたように、ノイズレベルが高いため、ロバストな手法を適用した解析が困難となっている状態のデータに適用している。そこで、利用可能なデータ値の範囲（下限値）を以下のように算出している。ここでは、ステップ１００５にて得た（１０^{（ｓ＊Ｚｉ）}）をｘ軸、ｘ_ｉをｙ軸として値をプロットしたグラフにおいて、直線性が維持される範囲（ないし下限値）を見出せばよい（ステップ１００９）。このようにして決定された下限値も、結果記憶部４２に記憶される。図１７は、あるＤＮＡ由来のデータに関して、（１０^{（ｓ＊Ｚｉ）}）をｘ軸、ｘ_ｉをｙ軸として値をプロットしたグラフの例を示す図である。図１７においては、１２本のピンで打たれたデータについて、１２個のデータ値のまとまりが、１つのグラフを示す。ここでは、（１０^{（ｓ＊Ｚｉ）}）として、約３．５のところで、直線性が失われている。この例では、ｓ≒０．７８であったため、Ｚｉの下限値が、約０．７であることがわかった。
【００７６】
次いで、データ値に割り当てられた理想値が範囲内（つまり、下限以上であること）であるものを取り出す。範囲内にないものについては、測定限界以下として、表示装置の画面上に表示することが望ましい。その一方、取り出された理想値は、標準化されたデータ値とされる（ステップ１０１０）。
【００７７】
第３の実施の形態によれば、ノイズレベルが高く、第１の実施の形態にかかる手法を適用できない場合であっても、対数正規分布をとるという前提のもと、データを標準化することが可能となる。また、データ値として利用可能な下限を特定することも可能となる。
本発明は、以上の実施の形態に限定されることなく、特許請求の範囲に記載された発明の範囲内で、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。
【００７８】
たとえば、初期的補正処理は、上述したものに限定されない。図１１は、初期的補正処理の他の例を示すフローチャートである。図１１に示す例においても、列或いは行ごとのデータの傾向を排除するために利用される。ここでは、列ごとに、その中心的傾向の特性値に基づいてバックグラウンド値が決定され（ステップ１１０１〜ステップ１１０３参照）、データ値から設定されたバックグラウンド値を減じた減算値が対数化される（ステップ１１０４）。次いで、対数値から、中心的傾向の特性値が減算される（ステップ１１０５）。なお、ここでも、中心的傾向の特性値として、列ごとのデータ値の中央値を用いても良いし、或いは、上限および下限を除去した残りのデータ値の平均値を用いても良い。また、バックグラウンド値として、前記特性値に比例定数を乗じたものを利用するのが望ましい。このような処理を、列の末尾まで実行することにより（ステップ１１０６およびステップ１１０７参照）、チップの製造ムラが解決されたと考えることが可能となる。
【００７９】
また、前記実施の形態においては、ＤＮＡチップから取得したデータに対して、処理を施し、比較など解析可能なデータを得ているが、ＤＮＡチップに限定されるものではなく、いわゆるタンパクチップにも適用可能である。つまり、タンパクチップのサンプル中の粗たんぱく質をラベルして抗体チップにかけることで得られたデータに対しても、本発明を適用することが可能となる。
さらに、本発明は、ＤＮＡチップやタンパクチップに限定されるものではなく、マイクロビーズにＤＮＡなど遺伝子を固定したものから取得されるデータなど、任意の手法で取得した遺伝子発現量を表すデータに対しても、同様に適用することができる。
【００８０】
なお、本発明にかかるデータ処理方法を施すデータを提供するＤＮＡチップとして、ｃＤＮＡのクローンのスポット位置をそのクローンの出自や発現の強さと切り離してランダムであるようなものを利用する望ましい。また、単一の組織由来のクローンをスポットする場合、また、限られた種類のクローンをスポットする場合には、データの中心的傾向の特性値（や変動の特性値）を測定するためのコントロールとして、ランダムに選択したクローンを複数種類スポットしておくのが望ましい。
【００８１】
【発明の効果】
本発明によれば、ＤＮＡチップから得られるデータに、より精度のよい解析を可能とするためのデータ処理方法を提供することが可能となる。
【図面の簡単な説明】
【図１】図１は、本発明の第１の実施の形態にかかる解析装置のハードウェア構成図である。
【図２】図２は、第１の実施の形態にかかる解析装置の要部の機能ブロックダイヤグラムである。
【図３】図３は、本実施の形態にかかる解析装置による処理の概略を示すフローチャートである。
【図４】図４は、第１の実施の形態にかかるバックグラウンド値算出処理をより詳細に示すフローチャートである。
【図５】図５は、本実施の形態にかかるパラメータ算出処理を示すフローチャートである。
【図６】図６は、本実施の形態にかかる初期的補正処理の一例を示すフローチャートである。
【図７】図７は、本実施の形態にかかる初期的補正処理の一例を示すフローチャートである。
【図８】図８は、第２の実施の形態にかかる処理の概略を示すフローチャートである。
【図９】図９は、第２の実施の形態にかかる処理の概略を示すフローチャートである。
【図１０】図１０は、第３の実施の形態にかかる解析装置により実行される処理を概略的に示すフローチャートである。
【図１１】図１１は、本発明にかかる初期的補正処理の他の例を示すフローチャートである。
【図１２】図１２は、バックグランド候補値ごとの差異の指標の例を示すグラフである。
【図１３】図１３は、バックグランド候補値ごとの差異の指標の例を示すグラフである。
【図１４】図１４は、理想値（理論値）を横軸に、実測値に基づくデータ値を縦軸にとり、値をプロットしたグラフの例である。
チャートである。
【図１５】図１５は、理想値（理論値）を横軸に、実測値に基づくデータ値を縦軸にとり、値をプロットしたグラフの他の例である。
【図１６】図１６は、あるＤＮＡチップから取得したデータのスポットごとのデータ値および移動平均値を示すグラフである。
【図１７】図１７は、あるＤＮＡ由来のデータに関して、（１０^{（ｓ＊Ｚｉ）}）をｘ軸、ｘ_ｉをｙ軸として値をプロットしたグラフの例を示す図である。
【符号の説明】
１０　　解析装置
３０　　データバッファ
３２　　バックグラウンド候補算出部
３４　　前処理部
３６　　変換／標準化処理部
３８　　差異算出／比較処理部
４０　　画像形成処理部
４２　　結果記憶部
４４　　データ修正部
４６　　ソート／抽出処理部[0001]
[Industrial technical field]
The present invention relates to a technique for statistically analyzing gene expression data.
[0002]
[Prior art]
In order to acquire gene expression data, it is known to use a DNA chip. A DNA chip is obtained by fixing a plurality of genes as different spots on a substrate such as a slide glass. For example, thousands to tens of thousands of genes are fixed as targets in the microarray. As a target, single-stranded DNA or mRNA is used.
[0003]
Various substrates capable of holding nucleic acids, such as a plate made of glass with various coatings, a film made of nylon or nitrocellulose, a hollow fiber, a semiconductor material, a metal material, an organic substance, or the like can be used as a DNA chip substrate. In addition, as a target, a product obtained by duplicating all or part of cDNA, a product obtained by duplicating a part of genomic DNA, synthetic DNA and / or synthetic RNA can be used. In order to fix the target to the base material, a method of synthesizing oligo DNA on a glass plate by a photolithographic method and a method of attaching the target to the base material using a spotter or the like are known.
[0004]
For example, DNA or RNA (analysis target) with a fluorescent label is hybridized to such a DNA chip. Analytes that are complementary to the target form a duplex. Since the analysis target is fluorescently labeled, image data obtained by operating the DNA chip with a fluorescent scanner can be obtained after hybridization. Based on the image data acquired in this way, it is possible to know at which spot a double strand is formed. More specifically, the obtained image displays spots derived from each DNA as a result of hybridization. Therefore, by integrating the signal intensity of a predetermined region including the spot position, array data including values indicating the signal intensity of each spot can be obtained.
[0005]
For example, array data showing a large number of gene expressions can be obtained by a single experimental operation using a microarray in which thousands to tens of thousands of targets are fixed. As a result, when measuring the increase / decrease in the data of a single gene expression, the average of the data indicating the gene expression (value indicating the signal intensity) is calculated as the target, and the data is normalized based on this. Is common. More specifically, the data is standardized before comparing the expression data from experiment to experiment. For example, “Normalization strategies for cDNA microarrays (Nucleic Acids Research) by John Schuchhardt et al.
(2000) Vol. No. 28 10) "discloses an example of the standardization.
[0006]
[Problems to be solved by the invention]
The probability distribution of the acquired data is nonparametric. However, for example, “Chatting the dream: plant EST microarrays (Current Opinion in Plant” by Todd Richmond et al.
Biology (2000) Vol. 3 pp 108-116) ", in order to standardize the acquired data, the Z-standard, t-standard, or the integrated value of the signal intensity of each spot is the arithmetic average of the whole numbers. The technique of removing is used.
[0007]
Since these are not non-parametric methods, there is a problem that such standardization significantly impairs data accuracy.
Moreover, the array data based on the image acquired by the fluorescent scanner always includes a background component. This is because the background signal intensity existing in the entire image data and the measurement range and the actual spot size and shape do not always match. Therefore, it is important for accurate analysis to subtract the background component from the numerical value of the acquired image data and acquire data consisting of true signal values. The same applies to array data obtained by other methods, for example, detection of electrical signals and detection of radiation.
[0008]
Conventionally, the background component was estimated by obtaining the average value or median value per pixel based on the numerical value representing the signal intensity of a specific spot or non-spotted part, and multiplying this value by the number of pixels in the measurement area. .
Alternatively, as suggested by Michael Eisen in “ScanAlyze User Manual (http://rana.lbl.gov/EisenSoftware.htm)”, for each spot, the background component is calculated from the value outside the measurement range. An estimation method is also known.
However, in the conventional correction method, the estimated value of the background changes due to differences in spots used for calculating the background value and areas in the image. That is, there is a possibility that various background values may be estimated from the above differences, and it is impossible to determine which one is appropriate. In particular, the difference in the background value may be large between the area where DNA is spotted and the area where DNA is not.
[0009]
Therefore, the present inventor has found that the logarithmic value of data obtained from a DNA chip (data indicating the amount of luminescence by gene expression) has a three-parameter normal distribution, logarithmically transforms the data, and further normalizes (for example, z -Proposed standardization. By the above method, it became possible to accurately compare the results of different experiments and the same kind of experiments.
Another object of the present invention is to provide a data processing method capable of performing more accurate analysis based on gene expression data obtained from a DNA chip or the like.
[0010]
[Means for Solving the Problems]
An object of the present invention is an array composed of values indicating the signal intensity of each spot arranged on a chip by array data obtained based on the expression level of a gene, for example, hybridization of a DNA chip or a protein chip. A method of processing gene expression data for processing data to obtain analyzable data, acquiring the array data, sorting data values of the acquired array data, and from the sorted data values Extracting a predetermined number of data values at predetermined intervals and temporarily storing them in the storage means; selecting a plurality of background candidates and temporarily storing them in the storage means; From each of the extracted data values, each background candidate value is subtracted to obtain a subtraction value, and each subtraction value is logarithmically transformed. Obtaining a logarithmic value, temporarily storing the logarithmic value in a storage means, calculating a standard value of a normal distribution corresponding to each of the logarithmic values, and for each background candidate, A step of calculating an index indicating a difference between a numerical value and a standard value, a step of narrowing a range of values of the background candidates based on the index, acquisition of the subtraction value and logarithmic value, and an index indicating the difference The step of determining the background value by repeating the calculation and the selection of the background candidate value and the logarithm value temporarily stored in relation to the determined background value are respectively standardized and standardized. The method is achieved by a method of processing gene expression data, comprising the step of storing each value in a storage means.
[0011]
According to the present invention, based on the difference between the logarithmic value of the sorted value and the corresponding standard value, the background value that minimizes the difference is determined, so a more appropriate background value is determined. As a result, the data to be analyzed including comparison with other data can be made more appropriate.
[0012]
In addition, as the index of the difference, a sum of absolute values of differences, a sum of squares of differences (square error), a least-square method “r”, and the like can be used. The predetermined interval when a predetermined number of data values are extracted from the sorted data values at a predetermined interval includes that the interval is “0”, that is, all data is extracted. The standard value corresponding to the i-th data value of the extracted n pieces of data may be the i-th n quantile of the normal distribution.
[0013]
Another object of the present invention is a gene expression data processing method for processing the array data obtained based on the gene expression level to obtain analyzable data, and acquiring and acquiring the array data Sorting the array array data values, extracting a predetermined number of data values from the sorted data values at predetermined intervals, and temporarily storing them in the storage means; and determining a background value γ A step of storing this in the storage means, a step of logarithmically subtracting the data value obtained by subtracting the background value, obtaining a logarithmic value, and temporarily storing it in the storage means; Referring to the logarithmic value, the characteristic value μ of the central tendency and the characteristic value σ of fluctuation are calculated and stored in the storage means, and z = (lo as the standard value z for each data value x (X−γ) −μ) / σ is calculated, and the calculated standard value z is stored in the storage means, respectively. .
[0014]
According to the present invention, using the calculated parameters γ, μ, and σ, the data value x of the array data is normalized to z = (log (x−γ) −μ) / σ, respectively, for further analysis. A suitable product can be obtained.
In a preferred implementation, the step of determining the background value γ selects a plurality of background candidates and temporarily stores them in the storage means, and from each of the extracted data values, Subtracting each background candidate value to obtain a subtraction value, obtaining a logarithmic value obtained by logarithmically converting each subtraction value, and temporarily storing the logarithmic value in a storage means; A step of calculating a standard value of a normal distribution corresponding to each, a step of calculating an index indicating a difference between each logarithmic value and a standard value for each background candidate, and the background based on the index Narrowing the range of ground candidate values, obtaining the subtraction value and logarithmic value, calculating an index indicating the difference, and narrowing the background candidate value By repeating, it is configured to determine the background value.
[0015]
In a more preferred embodiment, the step of calculating the characteristic value μ of the central tendency and the characteristic value σ of variation includes calculating a standard value corresponding to each of the logarithmic values, and the logarithmic value and the standard value. Comparing and calculating the range in which the ratio of both is substantially constant, and calculating the slope and y intercept of the straight line formed in the range when the standard value is considered as the x axis and the logarithmic value as the y axis And determining the calculated y-intercept as the characteristic value μ of the central tendency and determining the slope as the characteristic value σ of the fluctuation. Here, using a so-called normal probability plot (NPP), a region in which linearity is ensured is found, and the slope and intercept of the straight line derived from the region are determined as σ and μ, respectively. As a result, more robust standardization can be realized.
[0016]
In another preferred embodiment, the step of rearranging the data values in the order of the spots arranged on the chip and temporarily storing them in the storage means in that order, and the spots are arranged on the chip With respect to a column or row, a step of calculating an index indicating a tendency of a data value for the column or row, and if there is a feature for each column or row based on the index, the data for each column or each row A step of calculating a median value, and a step of dividing the data value by a corresponding median value to obtain a division value and temporarily storing it in a storage means. The stored division value is set as a calculation target as a value corresponding to the data value of the array data.
[0017]
According to this embodiment, when there is a problem with the accuracy of the array chip, the column or row becomes unique due to the accuracy of the stamping machine or the origin of the clone placed in the spot of the chip itself. Even in such a case, it is possible to eliminate the specificity and make it possible to perform robust standardization.
The step of calculating the index indicating the tendency may include a step of calculating a moving average related to a specific column or row.
[0018]
In another preferred embodiment, the data values are further rearranged in the order of the spots arranged on the chip and temporarily stored in the storage means in that order, and the data values in the order A step of finding the periodicity of the data, and a step of subtracting a characteristic value of the central tendency of the cycle from each data value when there is the periodicity, and temporarily storing it in the storage means The temporarily stored subtraction value is set as a calculation target as a value corresponding to the data value of the array data. Here, when the value of the array data has a certain periodicity, more appropriate data can be obtained as an analysis target by eliminating elements having periodicity.
[0019]
In another embodiment, the data value is rearranged in the order of the spots arranged on the chip, and the column or row where the spot is arranged on the chip is further arranged for each column or row. , Calculating a characteristic value of the central tendency of the data value, and setting a background value relating to the spot belonging to the column or row based on the characteristic value of the central tendency, and from each of the data values relating to the spot, Subtracting a background value, calculating a subtraction value, logarithmizing the subtraction value to obtain a logarithmic value, and subtracting a characteristic value of a central tendency of the logarithmic value for the column or row And temporarily storing the subtraction value in a storage means, and the subtraction value stored temporarily is stored in the array data. As a value corresponding to the data values, the calculation target.
[0020]
Furthermore, an object of the present invention is a gene expression data processing method for processing array data obtained on the basis of gene expression levels to obtain analyzable data, wherein a row in which spots are arranged in the chip Alternatively, with respect to a row, for each column or row, calculating a characteristic value of a central tendency of a data value, and based on the characteristic value of the central tendency, background value candidates relating to spots belonging to the column or row are calculated. Setting, subtracting a background candidate value from each of the data values related to the spot, calculating a subtraction value, logarithmizing each of the subtraction values to obtain a logarithmic value, and the column or row Calculating a characteristic value of a central tendency of the logarithmic value and subtracting from each of the logarithmic values to calculate a second subtraction value; Dividing the data value by the characteristic value of the fluctuation calculated based on the second subtraction value for a column or row, obtaining a division value, and temporarily storing it in a storage means; Comparing the division value with a corresponding standard value and determining a background candidate value γ as a background value γ such that an indicator of the difference between them is minimized; It is also achieved by a method of processing gene expression data, characterized by comprising a step of storing the characteristic value μ of the central tendency and the characteristic value σ of the fluctuation associated with the background value γ in the storage means.
[0021]
According to the present invention, the background value is determined based on the characteristic value of the central tendency for each column or row. For example, the background value for each column can be considered as a proportional constant multiple of the characteristic value of the central tendency of the column. This makes it possible to eliminate column and row specificity.
[0022]
Another object of the present invention is a gene expression data processing method for processing the array data obtained based on the gene expression level to obtain analyzable data, and acquiring and acquiring the array data Sorting the array array data values, temporarily storing the sorted data in storage means, calculating a standard value of a normal distribution corresponding to the sorted data values, and For the data value, the characteristic value s of the variation is set and stored in the storage means, and the standard value is multiplied by each to obtain a multiplication value, and the data value and the multiplication value are compared. And calculating the slope and y intercept of the straight line formed in the range when the step of obtaining a range where the ratio of both is constant and the multiplication value is considered as the x-axis and the logarithmic value as the y-axis A step of determining the natural logarithm of the slope as the characteristic value u of the central tendency and the intercept as the background value g and storing them in a storage means. It is also achieved by the method.
[0023]
For example, if the noise level of the entire hybridization is high due to a wet experiment failure, etc., and that level cannot be ignored, a lognormal distribution is expected from the combination of chip and sample data if there is no noise. If possible, standardization using the above technique can be applied.
Here, xi = (10 ^u ) * (10 ^{(S * Zi)} ) + G
(Where Zi is the i-th standard value), xi is solved, this is temporarily stored in the storage means, and a lower limit value of a value that can be used as the xi is obtained, It is desirable to include a step of storing this in the storage means. Thereby, it is possible to know the range of data that can be used as an analysis target.
[0024]
Another object of the present invention is a computer-readable program for operating a computer so as to obtain data that can be analyzed by processing array data obtained based on the expression level of a gene. Acquiring data, sorting the data values of the acquired array data, extracting a predetermined number of data values at predetermined intervals from the sorted data values, and temporarily storing them in a storage means; Selecting a plurality of background candidates, temporarily storing them in the storage means, and subtracting the value of each background candidate from each of the extracted data values to obtain a subtraction value; And obtaining a logarithmic value obtained by logarithmically converting each subtraction value and temporarily storing the logarithmic value in a storage unit, and corresponding to each of the logarithmic values. A step of calculating a standard value of the distribution; a step of calculating an index indicating a difference between each logarithmic value and the standard value for each background candidate; and a range of values of the background candidate based on the index A step of determining a background value by repeatedly obtaining the subtracted value and logarithmic value, calculating an index indicating a difference, and narrowing down the values of background candidates, and the determined background value. It is achieved by a program characterized by causing the computer to execute a step of standardizing logarithm values temporarily stored in association with each other and storing the standardized values in storage means.
[0025]
Furthermore, an object of the present invention is a computer-readable program for operating a computer so as to obtain data that can be analyzed by processing array data obtained based on the expression level of a gene, Acquiring data, sorting the data values of the acquired array data, extracting a predetermined number of data values at predetermined intervals from the sorted data values, and temporarily storing them in a storage means; Determining a background value γ and storing it in the storage means; and subtracting the data value obtained by subtracting the background value from the logarithm to obtain a logarithmic value, which is temporarily stored in the storage means. And storing the characteristic value μ of the central tendency and the characteristic value σ of the fluctuation with reference to the logarithmic value, and storing them in the storage means, For each data value x, z = (log (x−γ) −μ) / σ is calculated as the standard value z, and the calculated standard value z is stored in the storage means, respectively. It is also achieved by a program characterized by having the program executed.
[0026]
Alternatively, an object of the present invention is a computer-readable program for operating a computer so as to obtain data that can be analyzed by processing array data obtained based on the expression level of a gene. And calculating a characteristic value of the central tendency of the data value for each column or row, and a spot belonging to the column or row based on the characteristic value of the central tendency Setting a background value candidate for the spot, subtracting the background candidate value from each of the data values for the spot, calculating a subtraction value, and logarithmizing the subtraction value to obtain a logarithmic value And calculating a characteristic value of the central tendency of the logarithmic value for the column or row, Subtracting the second subtraction value and calculating the division value by dividing the data value by the characteristic value of the fluctuation calculated based on the second subtraction value for the column or row. And temporarily storing it in the storage means, comparing the divided value with the corresponding standard value, and selecting a background candidate value that minimizes the index of the difference between them. Determining the value γ, and storing the background value γ, the characteristic value μ of the central tendency associated with the background value γ, and the characteristic value σ of variation in the storage unit, respectively, in the computer It is also achieved by a program characterized by being executed.
[0027]
Another object of the present invention is a computer-readable program for operating a computer so as to obtain data that can be analyzed by processing array data obtained based on the expression level of a gene. Obtaining data, sorting the data values of the obtained array data, temporarily storing the sorted data in a storage means, and a standard value of a normal distribution corresponding to the sorted data values Calculating a characteristic value s of the fluctuation of the data value, storing it in the storage means, multiplying each of the standard values to obtain a multiplication value, and the data value And a multiplication value, a step of obtaining a range in which the ratio of both is constant, and the range when the multiplication value is considered as an x-axis and a logarithmic value as a y-axis, Calculating the slope and y-intercept of the straight line formed in the above, and determining the natural logarithm of the slope as the characteristic value u of the central tendency and the intercept as the background value g and storing them in the storage means Is also achieved by a program that causes the computer to execute the above.
[0028]
Nucleic acids can be retained on the surface of DNA chips, such as plates made of glass with various coatings, membranes based on nylon or nitrocellulose, hollow fibers, semiconductors, metals, organic substances, etc. Anything can be used. Further, on the DNA chip, as a target, all or a part of the cDNA, genomic DNA, synthetic DNA, synthetic RNA and the like are arranged.
[0029]
In order to produce a chip, there are a method of preparing a nucleic acid and arranging it on a substrate by adsorption, electrostatic binding, and covalent bonding, and a method of synthesizing a nucleic acid on the substrate. The detection of the signal indicating the signal intensity includes an electrical method using a semiconductor chip, a method of detecting fluorescence and radioactivity, and the like.
[0030]
The present invention can also be applied to array data from a DNA chip in which any target is formed on any of the above-mentioned substrates. Further, the present invention can be applied to array data acquired using any method. The same applies to data obtained from other media such as microbeads on which genes such as immobilized DNA are immobilized.
[0031]
In this specification, the DNA chip is any chip in which a nucleic acid is arranged on a substrate, such as an RNA chip formed with RNA on a substrate, a microarray, a macroarray, a dot blot, or a reburst northern. Including.
[0032]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. FIG. 1 is a hardware configuration diagram of an analysis apparatus according to the first embodiment of the present invention. As shown in FIG. 1, the analysis device 10 includes a CPU 12, an input device 14 such as a mouse and a keyboard, a display device 16 including a CRT, a RAM (Random Access Memory) 18, and a ROM (Read Only Memory). ) 20, a portable storage medium driver 22 for accessing a portable storage medium 23 such as a CD-ROM or a DVD-ROM, a hard disk device 24, and an interface (I / F) 26 for controlling data exchange with the outside. It has. As can be understood from FIG. 1, a personal computer or the like can be used as the analysis apparatus 10 according to the present embodiment.
[0033]
The I / F 26 is connected to a reader or scanner (not shown) or a communication circuit that measures the light emission amount of the spot on the hybridized DNA chip and generates data based on the measured light emission amount. The communication circuit is further connected to an external network (for example, the Internet).
In the present embodiment, the portable storage medium 23 receives data from a reader or a scanner and executes a program for executing necessary data conversion processing to be described later on the data, and the processed data. A program for analysis is stored. Therefore, the portable storage medium driver 22 reads out the program from the portable storage medium 23, stores it in the hard disk device 24, and starts it, whereby the personal computer operates as the analysis device 10. It becomes possible. Alternatively, the program may be downloaded via an external network such as the Internet.
[0034]
FIG. 2 is a functional block diagram of the main part of the analysis apparatus 10 according to the first embodiment. FIG. 2 shows components that execute processing for deriving analysis results of gene expression data. As shown in FIG. 2, the analysis device 10 includes a noise component in a light emission amount of a spot on a DNA chip based on a data buffer 30 and data (original data) temporarily stored in the data buffer 30. A background candidate calculation unit 32 for calculating a background value candidate corresponding to, and performing a predetermined pre-processing on the original data and before performing an operation between the background candidate value and the original data The processing unit 34, the conversion that will be described later are performed on the data that has been subjected to the calculation, the conversion / standardization processing unit 36 that standardizes the converted data, and the difference between the standardized value and the ideal value is calculated. In addition, the difference calculation / comparison processing unit 38 that compares the differences of the plurality of background candidates and calculates the correction value of the graph based on the comparison result is presented to the user. An image forming unit 40 for forming an image, and a result storage unit 42 for storing various data obtained.
[0035]
The pre-processing unit 34 performs processing for increasing the randomness when the original data has regularity depending on the sequence and position (region) of the DNA chip, and if necessary, A sort / extraction processing unit 46 that sorts the data corrected by the data correction unit 44 and extracts a predetermined data from the sorted data group.
[0036]
The function of the data buffer 30 is realized by the RAM 18 and, in some cases, the hard disk device 24. In the data buffer, the data indicating the light emission amount of each spot transmitted from the reader or the scanner, or the light emission amount of each spot transmitted from the reader or the scanner and stored in a predetermined area of the hard disk device 24 in advance is stored. The indicated data is temporarily stored. In addition, the data buffer 30 is used for the background candidate value calculated by the background candidate calculation unit 32, the data processed by the preprocessing unit 34, and in some cases, logarithmically converted data and calculation. It is also possible to temporarily store standard values or ideal values to be used.
[0037]
From the reader or the scanner, a DNA chip taken by a CCD camera or the like and the signal intensity for each spot integrated is output as array data. Alternatively, the background value is determined based on the image data value of the image captured by the CCD camera in the reader or scanner, and the background value is subtracted from the signal intensity of each pixel, and the background correction has already been performed. In some cases, the signal intensity for each spot is integrated from the data and output as array data. In the present embodiment, any of unprocessed array data and data that has been subjected to correction processing (background correction) by the reader, scanner, or accompanying software can be used. In this specification, data accumulated from the signal for each spot transmitted from the reader or scanner is referred to as array data or data serving as a basis for performing background processing according to the present embodiment. This is referred to as original data.
[0038]
The processing for calculating an index that can be compared with other data based on the data indicating the light emission amount appearing on the DNA chip in the analysis apparatus 10 will be described in detail below. FIG. 3 is a flowchart showing an outline of processing by the analysis apparatus 10 according to the present embodiment. As shown in FIG. 3, first, the analysis apparatus 10 acquires original data relating to a certain DNA chip from the data buffer 30 (step 301), and performs preprocessing on this (see step 310). In the present embodiment, the preprocessing includes arbitrary initial correction processing (step 302) executed as necessary based on the state of the original data, sorting processing of the acquired original data (step 303), and In the sorted data group, extraction of data values located in a predetermined order (step 304) is included. The initial correction process will be described in detail later.
[0039]
The sort / extraction processing unit 46 in the pre-processing unit 34 extracts data located at a predetermined interval for the data whose values are rearranged in ascending or descending order by the sorting process. For example, values in a predetermined order may be extracted at predetermined intervals such as 10th, 20th, 30th,. Alternatively, a predetermined quantile may be extracted such as the first percentile, the second percentile, and so on. The sorted data and the extracted data are stored in a predetermined area of the data buffer 30.
[0040]
A background value is then calculated (step 305) and other parameters are calculated (step 306). In the present embodiment, the knowledge that the logarithmic value of data obtained from a DNA chip (data indicating the amount of luminescence by gene expression) is normally distributed, and the results of different experiments or the same kind by z-standardizing this Based on the fact that the experimental results can be accurately compared, a more robust standardized data group is obtained from the data of a certain DNA chip.
[0041]
Here, in the present embodiment,
Of z = (log (x−γ) −μ) / δ, the calculated background value is γ, and the remaining parameters μ and δ are calculated by an operation described later. First, calculation of the background value will be described in more detail, and then calculation of the remaining parameters will be described in detail.
[0042]
FIG. 4 is a flowchart showing the background value calculation process (step 305) in more detail. The background candidate calculation unit 32 determines a range of background value candidates (background candidate values) and a plurality of background candidate values in the range in accordance with an input by an operator's operation of an input device or the like. For example, if the user specifies a starting point (for example, “0 (zero)”) and an ending point (for example, the median or the first quartile) of the background candidate value, between the starting point and the ending point A predetermined number of values that are equally spaced (or equivalent) are determined. For example, when “0” and the median value are designated, eight values are taken at equal intervals between them, and ten background candidate values including the start point and the end point are determined. In this process, the background candidate value is stored in the data buffer 30, and the value is read and updated as necessary.
[0043]
Next, a certain background candidate value is subtracted from the extracted original data value (original data value) (step 402), and the original data value from which the background candidate value is subtracted by the conversion / standardization processing unit 36 is a logarithm. Conversion is performed (step 403). The logarithmically converted data acquired here is also stored in the data buffer 30 for use in later processing. Steps 402 and 403 are performed for all selected background candidate values (eg, 10).
[0044]
Next, the logarithmically transformed data value (transformed value) relating to a certain background candidate value is compared with the corresponding standard value calculated by the following method and stored in the data buffer 30, and the difference between the values is compared. An index representing is calculated (step 404). Here, in the present embodiment, the standard value is obtained as follows.
Since the quantiles have a range, the following numerical values are calculated to correct the statistical median.
m (i) = (i−0.3175) / (n + 0.365)
Where n is the number of data and i is a natural number from 1 to n.
[0045]
Next, for each of the determined m (i), the inverse function F of the normal distribution function ^-1 (R) is applied. Each of the obtained values becomes a standard value corresponding to the data value.
Next, the difference calculation / comparison processing unit 38 calculates, for example, the sum of absolute values of differences (difference between data values and standard values) or the sum of squares of differences for each background candidate value. The value obtained here becomes a difference index of each background candidate value. Of course, the least-square method “r” may be used as the difference index. Actually, the case where “r” of the least square method is used is desirable from the viewpoint of obtaining a highly accurate background value.
[0046]
Next, for example, the difference calculation / comparison processing unit 38 generates a graph with the background candidate value as the horizontal axis and the difference index as the vertical axis, and displays this on the screen of the display device 16 (step 405).
[0047]
The operator refers to the graph displayed on the screen of the display device 16 and selects a desired background candidate value range or background value (step 406). If the selected value is considered sufficiently satisfactory as the background value (Yes in step 407), the process ends. On the other hand, if it is not satisfactory, a predetermined number of new background candidate values are determined from the newly selected narrower range of background candidate values (step 408), step The processes of 402 to 407 are repeated. The new background candidate value may also be a value obtained by dividing the background candidate value range between the start point and the end point at equal intervals, or may be a value obtained by equally dividing the value. good. The background value finally obtained is stored in the result storage unit 42.
[0048]
For example, as shown in FIG. 12, a graph is generated in which the range of background candidate values is on the horizontal axis and the difference index is on the vertical axis. In the example of FIG. 12, values in increments of 100 from 1800 to 2700 (1800, 1900, 2000,..., 2700) are adopted as background candidate values. The observer can refer to this to narrow down the range of the background candidates, and obtain again the index of the difference regarding the value of the background candidate in the new range (see FIG. 13). In the example of FIG. 13, it can be understood that “2363” as the background value is most appropriate at this point.
[0049]
Next, a description will be given of processing for calculating the remaining parameters. In general, in the lognormal distribution, an average value is used as μ (characteristic value of central tendency) of log data, and a standard deviation is used as δ (characteristic value of fluctuation). However, in the data obtained from the DNA chip, a strong signal (with a relatively large data value) is accurate and a weak signal (with a relatively small data value) contains relatively large noise. Since data that is hidden by noise and has a negative value cannot be obtained as a logarithmic value, many of these weak signals are discarded. In such a case, the calculation method cannot be used.
[0050]
Usually, the characteristic value of the central tendency is obtained with an average value. However, the average is not a so-called robust method, and is calculated to be high particularly in a situation where a weak signal is selectively lost. In these cases, the median is known to be more effective.
On the other hand, the characteristic value of the fluctuation is expressed with a standard deviation. However, the standard deviation is not a robust method, and is calculated to be smaller in the situation where a weak signal is selectively dropped as described above. On the other hand, as a robust method, iqr for obtaining a characteristic value of variation from a quartile range is known (for example, http://infoshako.sk.tsukuba.ac.jp/InfoRes/jdoc/MATLAB5/jhelp) /Toolbox/stats/iqr.html).
[0051]
However, the median is obtained from one point in the data group, and iqr is obtained from two points in the data group, and there is a difficulty in its accuracy. In particular, when the data acquired from a small number of spots or the number of data for correction is limited, the problem becomes serious. Therefore, in the present embodiment, a highly accurate parameter calculation method is employed by the following method even when the number of data is relatively limited.
[0052]
FIG. 5 is a flowchart showing parameter calculation processing according to the present embodiment. As shown in FIG. 5, first, an actual measurement value obtained by subtracting the ideal value and the background value is acquired (step 501). The ideal value is the same as the standard value obtained in the previous step 404. Next, a graph with the horizontal axis representing the ideal value (theoretical value) and the vertical axis representing the data value based on the actual measurement value is created and displayed on the screen of the display device (step 502). In this graph, if the actual measurement values are correctly distributed in a normal logarithm, this graph almost coincides with y = x. However, in practice, as shown in FIG. 14, the graph obtained by plotting the actual measurement values has a slope other than 1 (= a: a≈0.56 in FIG. 14) and a y-intercept (= b; 14 has b≈2.80), and the linearity is lost in the portion where the value of x is relatively small.
[0053]
However, even in the graph of FIG. 14, there is a portion that is recognized as a substantially straight line (for example, a portion where x is positive). Therefore, in this embodiment, when the user refers to the graph and operates the input device to specify a range that is determined to have linearity (step 503), the measured value in the specified range is obtained. By using this, a linear expression representing between the actually measured value and the theoretical value is obtained by, for example, the least square method. The slope “a” in the obtained linear expression “ax + b” corresponds to the characteristic value “σ” of the fluctuation, and the y-intercept “b” corresponds to the characteristic value “μ” of the central tendency (step 504). .
[0054]
For example, the image formation processing unit 40 of the analysis apparatus 10 uses the obtained “a” and “b” to take the ideal value on the horizontal axis, and the actual measurement value z = (log (x−γ) −μ) / δ. May be generated on the screen of the display device 38. FIG. 15 is an example of a graph in which the value divided by σ is plotted again after μ is reduced for the plotted values in FIG. If the user refers to the displayed graph and is not satisfied (No in step 505), the user returns to the range designation in the original graph, and the processing from step 503 onward is executed again. .
[0055]
On the other hand, if it is satisfactory (Yes in step 505), the DNA chip is identified with the previously obtained background value as “γ”, the intercept as “μ”, and the slope as “σ”. The information is stored in the result storage unit 42 in association with the information to be performed. For each of the data values x obtained from the DNA chip using the parameters thus obtained,
z = (log (x−γ) −μ) / σ
It is possible to standardize using the equation.
[0056]
Thus, according to the present embodiment, an appropriate background value is calculated to eliminate the influence of noise, and the characteristic value of the central tendency and the characteristic value of fluctuation for standardization are measured values. Is obtained from the straight line portion of the plotted graph. As a result, more robust standardization can be realized.
[0057]
Next, the initial correction process (step 302) according to the present embodiment will be described in more detail. In the present embodiment, two types of correction can be performed according to the characteristics of data from the DNA chip.
The DNA chip is formed by a method such as stamping DNA on a surface such as glass. At this time, the data value may appear “strong” or “weak” with a certain rule due to the accuracy of the stamping machine (arrayer or spotter).
Such a tendency may occur for each pin of the array, for each row of the spotted grid, or for each grid column / row of the microtiter plate holding the DNA sample.
[0058]
For example, when there is a feature in the strength of data for each horizontal row of the grid, it is conceivable to standardize the data in units of horizontal rows. However, in this case, the number n of data constituting one data set is reduced (for example, 32). When the background value is predicted from such a small number of data, and the characteristic value of the central tendency and the characteristic value of the fluctuation are calculated, the accuracy is remarkably lowered. It is known that the standard deviation of the average value of random numbers is proportional to the reciprocal of the square root of n. This indicates that it is difficult to accurately predict the characteristic value of the central tendency from a small number of data.
[0059]
Therefore, in the initial correction process, the moving average of the horizontal and vertical rows of the DNA chip is calculated, and if each column or row has a characteristic, the value is corrected for each column (first preprocessing: Reference numeral 600). Even in other cases, if the transition of values for each spot has periodicity, data correction is performed in consideration of the periodicity (second correction process: see FIG. 7).
[0060]
Hereinafter, a description will be given with respect to the row, but it goes without saying that the same processing can be executed in the vertical row. First, when a spotter is produced by a spotter, data is arranged in the order in which the spot is actually spotted. Among the data group, a certain column on the DNA chip and a predetermined number of columns before and after the column (for example, two columns before and after) The average value of the data values is calculated (steps 601 and 602). The calculation of the average value is repeated until the end of the column (see steps 603 and 604), and then it is determined whether or not the average value for each column has a feature (step 605). FIG. 16 is a graph showing the logarithmic value of each spot and the moving average value of the logarithmic value of data acquired from a certain DNA chip. In the example shown in FIG. 16, the DNA chip has 32 spots in a horizontal row. By taking an average value with the data values of a predetermined number of columns before and after, if the original data value is random, the average values are almost the same. In FIG. 16, in the graph of the logarithmic value for each spot indicated by the solid line, the tendency of the value cannot be seen, but the average value of the logarithmic value of the data corresponding to 32 spots in a certain column is indicated by the broken line As you can see, it varies widely. In such a case, it is determined that there is a characteristic for each column of the DNA chip (Yes in step 605), and the first preprocessing is performed on the data value.
[0061]
In step 605, a test may be performed as to whether or not the variation of the moving average value is significant.
In the first preprocessing, the median value of the data values corresponding to the spots of the DNA chip column is obtained (step 607), and the data values corresponding to the spots of the column are respectively divided by the median value (step 607). 607). This is performed for each column (see steps 609 and 610).
[0062]
Next, the second pre-processing will be described. Here, correction is performed in consideration of whether or not the data value corresponding to each spot vibrates. First, data values arranged in the order of spots are acquired (step 701), and FFT (Fast Fourier Transfer) processing is executed on the data group (step 702). If there is a periodic component (signal component) as a result of the FFT, the value of the component corresponding to the phase is subtracted from each data value in consideration of the cycle (steps 703 and 704). The operator may repeat the processing of steps 703 and 704 until a satisfactory result is obtained. Data that has been subjected to the first correction process or the second correction process is stored in the data buffer 30. This data is subjected to the following data sorting (see step 303 in FIG. 3).
As described above, according to the initial correction process according to the present embodiment, it is possible to eliminate regularity in spot production.
[0063]
Next, the second embodiment of the present invention will be described. In the second embodiment, appropriate parameters are calculated while eliminating periodicity in the second embodiment. 8 and 9 are flowcharts showing an outline of the processing according to the second embodiment. Also in the second embodiment, similarly to the initial correction described with reference to FIG. 6, data is arranged in advance in the order of spotting when a DNA chip is prepared by a spotter. Further, the present invention is not limited to rows, and the same processing can be executed for vertical rows as in the example of FIG.
[0064]
In this process, data of a predetermined column is acquired (step 801), and the characteristic value of the central tendency of the column is calculated from the data value of the column (step 802). Here, the median value may be used, or it may be obtained from the average value of the logarithmic values of the remaining data values from which the upper limit and the lower limit are removed. Next, the background value of the column is set (step 803). The background value to be set is considered to be proportional to the characteristic value of the central tendency obtained in step 802. That is, the background value is considered to be αMi with respect to a characteristic value Mi (i is a column number) of a central tendency of a certain column.
[0065]
Next, the data values with the background value reduced are respectively logarithmized (steps 804 and 805). When the data value is equal to or lower than the background value, the value cannot be converted into a logarithmic value. It is desirable to display such data on the screen of the display device as below the measurement limit. Thereafter, the characteristic value Mi of the central tendency or the value obtained by subtracting the background value from the characteristic value of the central tendency is subtracted from the logarithmic value (step 806). Further, a fluctuation characteristic value (second characteristic value) is set for the subtracted value, and the subtracted value is divided by the second characteristic value (step 807). Note that the characteristic value of the fluctuation is, for example, a graph in which the corresponding standard value is plotted on the x-axis and the divided values sorted on the y-axis, and a certain range (for example, the top 60) is plotted. It is desirable to set a value such that the range (% to 90%) is most approximate to y = x as the fluctuation characteristic value (second characteristic value) σ.
[0066]
That is, from step 801 to step 808, for a certain column i,
(Log (X−αMi) −Mi) / σ
Will be calculated. Such processing is executed for each column (steps 809 and 810). Further, these data values are temporarily stored in the data buffer 30.
[0067]
Thereafter, the temporarily stored data values are sorted and compared with the corresponding standard values (steps 901, 902). Again, a graph is generated with the corresponding standard value as the x-axis and the sorted data value as the y-axis, and it is determined whether the plotted point approximates y = x. In step 903 (Yes), the background value (αMi), the central tendency characteristic value (Mi), and the fluctuation characteristic value (σ) of each column are stored in the result storage unit 42 (step 904). ). Whether or not it is sufficient may be determined from the sum of the squares (square errors) of the differences between the corresponding standard values and the data values or the sum of the absolute values of the differences.
[0068]
When the difference exceeds a predetermined range (that is, when the line connecting the plotted points deviates from y = x by a predetermined value or more) (No in step 903), again, The proportional constant α is changed, and the change characteristic value σ is changed accordingly, and the processing from step 801 is executed. According to this embodiment, the background value regarding the i-th column of the DNA chip is αMi, and the characteristic value of the central tendency is Mi, so that a unique value is obtained for each column. It is possible to eliminate chip manufacturing unevenness.
[0069]
Next, the third embodiment of the present invention will be described. In the first embodiment, the background value (γ), the central tendency characteristic value (μ), and the fluctuation characteristic value (based on the data value (actual measurement value) actually obtained from the DNA chip σ) is calculated. However, there may be a case where noise to the median cannot be ignored. That is, for example, the noise level of the entire hybridization may become high due to, for example, a defective wet experiment. If the noise level is close to the median value, it is difficult to apply the robust method according to the first embodiment. Here, noise refers to a component caused by chance included in individual data, and is considered to be caused by a measurement error, a spot amount error, or the like. Noise is a concept corresponding to a signal, and raw data obtained from a DNA chip can be considered as the sum of noise and signal. The background can be defined as a portion that is not derived from RNA in a sample and is included in the signal of each data. Therefore, the signal can be regarded as the sum of the RNA-derived portion and the background.
[0070]
As described above, even if the noise level is high, the upper data is sufficiently larger than the noise level due to the nature of the lognormal distribution. These data should be analyzable if an appropriate characteristic value of the original central tendency can be found. If the background can be obtained, it is possible to find the characteristic value of the central tendency by a trial and improve method. However, the relationship between the background and the characteristic value of the central tendency is unknown. Of the lognormal distribution using the three parameters introduced in the present invention, it is difficult to find two parameters by the above method due to the problem of calculation amount and solution selection that cannot be uniquely determined. is there.
[0071]
Therefore, if the logarithmic normal distribution can be expected from the combination of the chip and the sample if there is no noise, the value can be obtained by the following method. FIG. 10 is a flowchart illustrating processing according to the third embodiment. In the third embodiment, as shown in FIG. 10, the original data relating to the DNA chip is acquired from the data buffer 30 (step 1001), and the data is sorted in ascending or descending order by the sorting process. (Step 1002). The sorted data is also stored in the data buffer 30. Next, an ideal lognormal distribution value Zi (i = 1, 2,...) Is assigned to each of the sorted data values (step 1003). This ideal value Zi can be obtained by a method substantially similar to the calculation of the standard value (see step 403) in the first embodiment. To briefly explain again, first, m (i) shown below is calculated.
m (i) = (i−0.3175) / (n + 0.365)
Where n is the number of data and i is a natural number from 1 to n.
[0072]
Next, for each of the determined m (i), the inverse function F of the normal distribution function ^-1 (R) is applied. Each of the obtained values becomes Zi corresponding to the data value. This standard value is also stored in the data buffer 30 for use in later processing. Thus, when the ideal value Zi is obtained, each Zi is multiplied by the characteristic value (s) of the expected fluctuation (step 1004). Note that the characteristic value of the fluctuation can be considered to be not varied from experiment to experiment, and can be predicted to some extent.
[0073]
Then, the value obtained by multiplication is a power of 10 (that is, 10 ^{(S * Zi)} ) X-axis, measured value x _i Is generated on the y-axis (step 1005). In this graph, it can be considered that the straight line portion is a reliable region (confidence region). Therefore, for example, when the user selects a straight line portion (specifies the range) with reference to the displayed graph (step 1006), the intercept and inclination of the graph are calculated (step 1007). The logarithm of the obtained slope is stored as the characteristic value (u) of the central tendency, and the intercept is stored as the background value (g).
[0074]
The characteristic value of the central tendency thus obtained and the effectiveness of the background will be briefly described below. In standardization by three parameters according to the present invention (Z standardization), Zi can be expressed by the following equation.
Zi = {log (xi-g) -u} / s
Zi is an ideal value, xi is a corresponding actually measured value, and g, u, and s are a background value, a characteristic value of a central tendency, and a characteristic value of variation, respectively.
Solving the above equation for xi,
xi = (10 ^u ) * (10 ^{(S * Zi)} ) + G
It becomes. (10 ^{(S * Zi)} ) On the x-axis, x _i If the value is plotted with y as the y-axis, a line in which a certain range is linear is obtained. In this straight line, 10 ^u Is the slope, the characteristic value u of the central tendency can be obtained by taking the logarithm of the slope. The background value g, the central tendency characteristic value u, and the fluctuation characteristic value s acquired as described above are stored in the result storage unit 42 (step 1008).
[0075]
In the third embodiment, as described above, since the noise level is high, the third embodiment is applied to data in a state where analysis using a robust method is difficult. Therefore, the range of data values that can be used (lower limit value) is calculated as follows. Here, obtained in step 1005 (10 ^{(S * Zi)} ) X axis, x _i It is only necessary to find a range (or lower limit value) in which linearity is maintained in a graph in which values are plotted with y as the y-axis (step 1009). The lower limit value determined in this way is also stored in the result storage unit 42. FIG. 17 shows (10 ^{(S * Zi)} ) X axis, x _i It is a figure which shows the example of the graph which plotted the value on the y-axis. In FIG. 17, a group of twelve data values for one piece of data hit by twelve pins shows one graph. Here, (10 ^{(S * Zi)} ), The linearity is lost at about 3.5. In this example, since s≈0.78, the lower limit value of Zi was found to be about 0.7.
[0076]
Next, those whose ideal value assigned to the data value is within the range (that is, not less than the lower limit) are taken out. Those not within the range are desirably displayed on the screen of the display device as being below the measurement limit. On the other hand, the retrieved ideal value is a standardized data value (step 1010).
[0077]
According to the third embodiment, even if the noise level is high and the method according to the first embodiment cannot be applied, the data can be standardized on the assumption that a lognormal distribution is taken. It becomes possible. It is also possible to specify a lower limit that can be used as a data value.
The present invention is not limited to the above embodiments, and various modifications can be made within the scope of the invention described in the claims, and these are also included in the scope of the present invention. Needless to say.
[0078]
For example, the initial correction process is not limited to that described above. FIG. 11 is a flowchart illustrating another example of the initial correction process. The example shown in FIG. 11 is also used to eliminate the tendency of data for each column or row. Here, for each column, the background value is determined based on the characteristic value of the central tendency (see step 1101 to step 1103), and the subtraction value obtained by subtracting the set background value from the data value is logarithmized. (Step 1104). Next, the characteristic value of the central tendency is subtracted from the logarithmic value (step 1105). Also here, as the characteristic value of the central tendency, the median value of the data values for each column may be used, or the average value of the remaining data values from which the upper limit and the lower limit are removed may be used. Further, it is desirable to use a value obtained by multiplying the characteristic value by a proportional constant as the background value. By executing such processing up to the end of the column (see step 1106 and step 1107), it can be considered that chip manufacturing unevenness has been solved.
[0079]
In the above embodiment, the data obtained from the DNA chip is processed to obtain data that can be analyzed, such as comparison. However, the present invention is not limited to the DNA chip, but also to a so-called protein chip. Applicable. That is, the present invention can also be applied to data obtained by labeling a crude protein in a protein chip sample and applying it to an antibody chip.
Furthermore, the present invention is not limited to a DNA chip or a protein chip. For data representing gene expression levels obtained by any method, such as data obtained from DNA or other genes immobilized on microbeads. However, the same can be applied.
[0080]
In addition, it is desirable to use a DNA chip that provides data subjected to the data processing method according to the present invention in which the spot position of a cDNA clone is separated from the origin of the clone and the strength of expression. Also, when spotting clones from a single tissue, or when spotting a limited number of clones, a control to measure the characteristic value (or characteristic value of variation) of the central tendency of the data It is desirable to spot a plurality of types of randomly selected clones.
[0081]
【The invention's effect】
According to the present invention, it is possible to provide a data processing method for enabling more accurate analysis of data obtained from a DNA chip.
[Brief description of the drawings]
FIG. 1 is a hardware configuration diagram of an analysis apparatus according to a first embodiment of the present invention.
FIG. 2 is a functional block diagram of the main part of the analyzing apparatus according to the first embodiment;
FIG. 3 is a flowchart illustrating an outline of processing performed by the analysis apparatus according to the present embodiment;
FIG. 4 is a flowchart illustrating in more detail a background value calculation process according to the first embodiment;
FIG. 5 is a flowchart illustrating a parameter calculation process according to the present embodiment.
FIG. 6 is a flowchart illustrating an example of an initial correction process according to the present embodiment.
FIG. 7 is a flowchart illustrating an example of an initial correction process according to the present embodiment.
FIG. 8 is a flowchart illustrating an outline of processing according to the second embodiment;
FIG. 9 is a flowchart illustrating an outline of processing according to the second embodiment;
FIG. 10 is a flowchart schematically illustrating a process executed by an analysis apparatus according to a third embodiment.
FIG. 11 is a flowchart showing another example of the initial correction process according to the present invention.
FIG. 12 is a graph illustrating an example of an index of difference for each background candidate value.
FIG. 13 is a graph illustrating an example of an index of difference for each background candidate value.
FIG. 14 is an example of a graph in which values are plotted with ideal values (theoretical values) on the horizontal axis and data values based on measured values on the vertical axis.
It is a chart.
FIG. 15 is another example of a graph in which values are plotted with ideal values (theoretical values) on the horizontal axis and data values based on measured values on the vertical axis.
FIG. 16 is a graph showing a data value and a moving average value for each spot of data acquired from a certain DNA chip.
FIG. 17 shows (10 ^{(S * Zi)} ) X axis, x _i It is a figure which shows the example of the graph which plotted the value on the y-axis.
[Explanation of symbols]
10 Analysis device
30 data buffer
32 Background candidate calculator
34 Pre-processing section
36 Conversion / standardization processing part
38 Difference calculation / comparison processing unit
40 Image formation processing unit
42 Result storage
44 Data correction part
46 Sort / Extract Processing Unit

Claims

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、
前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、
複数のバックグラウンド候補を選択して、これを一時的に記憶手段に記憶するステップと、
前記抽出されたデータ値のそれぞれから、各バックグラウンド候補の値を減じて、減算値を取得し、かつ、各減算値を対数変換した対数値を得て、当該対数値を一時的に記憶手段に記憶するステップと、
前記対数値のそれぞれに対応する、正規分布の標準値を算出するステップと、
前記各バックグラウンド候補について、各対数値と標準値との間の差異を示す指標を算出するステップと、
前記指標に基づき、前記バックグラウンド候補の値の範囲を絞り込むステップと、
前記減算値および対数値の取得、差異を示す指標の算出、バックグラウンド候補の値の絞込みを繰り返すことにより、バックグラウンド値を決定するステップと、
前記決定されたバックグラウンド値に関連して一時的に記憶された対数値を、それぞれ標準化し、標準化された値を、それぞれ、記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法。A method of processing gene expression data to process the array data obtained based on the gene expression level to obtain analyzable data,
The array data is acquired, the data values of the acquired array data are sorted, a predetermined number of data values are extracted from the sorted data values at predetermined intervals, and this is temporarily stored in the storage means. Steps,
Selecting a plurality of background candidates and temporarily storing them in storage means;
From each of the extracted data values, the value of each background candidate is subtracted to obtain a subtraction value, and a logarithmic value obtained by logarithmically converting each subtraction value is obtained, and the logarithmic value is temporarily stored. The step of storing in
Calculating a standard value of a normal distribution corresponding to each of the logarithmic values;
Calculating an index indicating a difference between each logarithmic value and a standard value for each background candidate;
Narrowing a range of values of the background candidates based on the indicator;
Determining the background value by repeatedly obtaining the subtraction value and logarithmic value, calculating the index indicating the difference, and narrowing down the background candidate values;
Standardizing logarithm values temporarily stored in relation to the determined background value, and storing the standardized values in storage means, respectively. How to process the data.

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、
前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、
バックグラウンド値γを決定して、これを記憶手段に記憶するステップと、
前記バックグランド値を減じたデータ値である減算値を対数化して、対数値を取得し、これを記憶手段に一時的に記憶するステップと、
前記対数値を参照して、中心的傾向の特性値μおよび変動の特性値σを算出し、これらを記憶手段に記憶するステップと、
各データ値ｘについて、標準値ｚとして
ｚ＝（ｌｏｇ（ｘ−γ）−μ）／σを算出して、算出された標準値ｚを、それぞれ記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法。A method of processing gene expression data to process the array data obtained based on the gene expression level to obtain analyzable data,
The array data is acquired, the data values of the acquired array data are sorted, a predetermined number of data values are extracted from the sorted data values at predetermined intervals, and this is temporarily stored in the storage means. Steps,
Determining a background value γ and storing it in a storage means;
Logarithmically the subtraction value, which is a data value obtained by subtracting the background value, obtaining a logarithmic value, and temporarily storing it in a storage means;
Calculating the characteristic value μ of the central tendency and the characteristic value σ of the fluctuation with reference to the logarithmic value, and storing them in the storage means;
For each data value x, a step of calculating z = (log (x−γ) −μ) / σ as a standard value z and storing the calculated standard value z in a storage unit respectively is provided. A characteristic gene expression data processing method.

前記バックグラウンド値γを決定するステップが、
複数のバックグラウンド候補を選択して、これを一時的に記憶手段に記憶するステップと、
前記抽出されたデータ値のそれぞれから、各バックグラウンド候補の値を減じて、減算値を取得し、かつ、各減算値を対数変換した対数値を得て、当該対数値を一時的に記憶手段に記憶するステップと、
前記対数値のそれぞれに対応する、正規分布の標準値を算出するステップと、
前記各バックグラウンド候補について、各対数値と標準値との間の差異を示す指標を算出するステップと、
前記指標に基づき、前記バックグラウンド候補の値の範囲を絞り込むステップとを有し、
前記減算値および対数値の取得、差異を示す指標の算出、バックグラウンド候補の値の絞込みを繰り返すことにより、バックグラウンド値を決定するように構成されたことを特徴とする請求項２に記載の方法。Determining the background value γ,
Selecting a plurality of background candidates and temporarily storing them in storage means;
From each of the extracted data values, the value of each background candidate is subtracted to obtain a subtraction value, and a logarithmic value obtained by logarithmically converting each subtraction value is obtained, and the logarithmic value is temporarily stored. The step of storing in
Calculating a standard value of a normal distribution corresponding to each of the logarithmic values;
Calculating an index indicating a difference between each logarithmic value and a standard value for each background candidate;
Narrowing a range of values of the background candidates based on the index,
The background value is determined by repeating the acquisition of the subtraction value and logarithmic value, the calculation of an index indicating a difference, and the narrowing down of background candidate values. Method.

前記中心的傾向の特性値μおよび変動の特性値σを求めるステップが、
前記対数値のそれぞれに対応する標準値を算出するステップと、
前記対数値と標準値とを比較し、両者の比がほぼ一定に推移する範囲を求めるステップと、
前記標準値をｘ軸、対数値をｙ軸と考えた場合に、前記範囲において形成される直線の傾きおよびｙ切片を算出するステップと、
算出されたｙ切片を中心的傾向の特性値μと決定し、傾きを変動の特性値σと決定するステップとを有することを特徴とする請求項２または３に記載の方法。Obtaining the central tendency characteristic value μ and the fluctuation characteristic value σ;
Calculating a standard value corresponding to each of the logarithmic values;
Comparing the logarithmic value with a standard value, and determining a range in which the ratio of the two changes substantially constant;
Calculating the slope and y intercept of a straight line formed in the range when the standard value is considered as the x axis and the logarithmic value as the y axis;
4. The method according to claim 2, further comprising the step of: determining the calculated y-intercept as the characteristic value μ of the central tendency and determining the slope as the characteristic value σ of the fluctuation.

さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替え、その順序で記憶手段に一時的に記憶するステップと、
前記チップにおいてスポットが配置された列或いは行に関して、当該列或いは行ごとのデータ値の傾向を示す指標を算出するステップと、
前記指標に基づき、列或いは行ごとに特徴がある場合に、各列或いは各行について、それぞれ、そのデータ値の中央値を算出するステップと、
前記データ値を、対応する中央値で除して、除算値を取得して、これを記憶手段に一時的に記憶するステップとを備え、
前記一時的に記憶された除算値を、アレイデータのデータ値に対応する値として、演算対象とすることを特徴とする請求項１ないし４の何れか一項に記載の方法。Furthermore, rearranging the data values in the order of the spots arranged on the chip, and temporarily storing them in the storage means in that order;
Calculating an index indicating a tendency of a data value for each column or row with respect to a column or row in which spots are arranged in the chip;
Calculating the median of the data values for each column or each row when there is a feature for each column or row based on the index; and
Dividing the data value by the corresponding median value to obtain a division value and temporarily storing it in storage means,
The method according to any one of claims 1 to 4, wherein the temporarily stored division value is set as a calculation target as a value corresponding to a data value of array data.

前記傾向を示す指標を算出するステップが、特定の列或いは行に関する移動平均を算出するステップを含むことを特徴とする請求項５に記載の方法。6. The method of claim 5, wherein the step of calculating the trend indicator comprises calculating a moving average for a particular column or row.

さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替え、その順序で記憶手段に一時的に記憶するステップと、
前記順序で、データ値の周期性を見出すステップと、
前記周期性のある場合に、各データ値から、当該周期の中心的傾向の特性値を減じて減算値を算出し、これを記憶手段に一時的に記憶するステップとを備え、
前記一時的に記憶された減算値を、アレイデータのデータ値に対応する値として、演算対象とすることを特徴とする請求項１ないし６の何れか一項に記載の方法。Furthermore, rearranging the data values in the order of the spots arranged on the chip, and temporarily storing them in the storage means in that order;
Finding the periodicity of the data values in said order;
A step of subtracting a characteristic value of the central tendency of the period from each data value when the periodicity is present, and temporarily storing it in a storage means,
7. The method according to claim 1, wherein the temporarily stored subtraction value is set as a calculation target as a value corresponding to the data value of the array data.

さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替えるステップと、
前記チップにおいてスポットが配置された列或いは行に関して、当該列または行ごとに、データ値の中心的傾向の特性値を算出するステップと、
前記中心的傾向の特性値に基づき、当該列或いは行に属するスポットに関するバックグラウンド値を設定し、当該スポットに関するデータ値のそれぞれから、バックグラウンド値を減じて減算値を算出するステップと、
前記減算値を、それぞれ対数化して、対数値を取得するステップと、
前記列或いは行に関して、前記対数値の中心的傾向の特性値を減算し、前記減算値を、記憶手段に一時的に記憶するステップとを備え、
前記一時的に記憶された減算値を、アレイデータのデータ値に対応する値として、演算対象とすることを特徴とする請求項１ないし４の何れか一項に記載の方法。Reordering the data values in the order of the spots arranged on the chip;
Calculating the characteristic value of the central tendency of the data value for each column or row with respect to the column or row where the spot is arranged in the chip;
Setting a background value for a spot belonging to the column or row based on the characteristic value of the central tendency, subtracting the background value from each of the data values for the spot, and calculating a subtraction value;
Logarithmically each of the subtraction values to obtain a logarithmic value;
Subtracting the characteristic value of the central tendency of the logarithmic value with respect to the column or row, and temporarily storing the subtracted value in storage means,
The method according to any one of claims 1 to 4, wherein the temporarily stored subtraction value is set as a calculation target as a value corresponding to the data value of the array data.

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、
前記チップにおいてスポットが配置された列或いは行に関して、当該列または行ごとに、データ値の中心的傾向の特性値を算出するステップと、
前記中心的傾向の特性値に基づき、当該列或いは行に属するスポットに関するバックグラウンド値の候補を設定し、当該スポットに関するデータ値のそれぞれから、バックグラウンド候補値を減じて減算値を算出するステップと、
前記減算値を、それぞれ対数化して、対数値を取得するステップと、
前記列或いは行に関して、前記対数値の中心的傾向の特性値を算出し、前記対数値のそれぞれから減じて第２の減算値を算出するステップと、
前記列或いは行に関して、前記データ値を、前記第２の減算値に基づき算出される変動の特性値で除して、除算値を取得し、これを記憶手段に一時的に記憶するステップと、
前記除算値と、対応する標準値とを比較し、これらの間の差異の指標が最も小さくなるような、バックグラウンド候補値をバックグラウンド値γと決定するステップと、
前記バックグラウンド値γ、当該バックグラウンド値γと関連する中心的傾向の特性値μおよび変動の特性値σを、それぞれ記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法。A method of processing gene expression data to process the array data obtained based on the gene expression level to obtain analyzable data,
Calculating the characteristic value of the central tendency of the data value for each column or row with respect to the column or row where the spot is arranged in the chip;
Setting a candidate for a background value relating to a spot belonging to the column or row based on the characteristic value of the central tendency, and calculating a subtraction value by subtracting the background candidate value from each of the data values relating to the spot; ,
Logarithmically each of the subtraction values to obtain a logarithmic value;
Calculating a characteristic value of a central tendency of the logarithmic value for the column or row, and subtracting from each of the logarithmic values to calculate a second subtraction value;
Dividing the data value with respect to the column or row by the characteristic value of the fluctuation calculated based on the second subtraction value, obtaining a division value, and temporarily storing it in a storage means;
Comparing the division value with a corresponding standard value and determining a background candidate value as a background value γ such that an indicator of the difference between them is minimized;
Storing the background value γ, the characteristic value μ of the central tendency associated with the background value γ, and the characteristic value σ of fluctuation in a storage means, respectively, Method.

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得する遺伝子発現データの処理方法であって、
前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、ソートされたデータを、記憶手段に一時的に記憶するステップと、
前記ソートされたデータ値に対応する、正規分布の標準値を算出するステップと、
前記データ値に関して、その変動の特性値ｓを設定して、これを記憶手段に記憶するとともに、前記標準値のそれぞれに乗じて、乗算値を得るステップと、
前記データ値と乗算値とを比較し、両者の比が一定に推移する範囲を求めるステップと、
前記乗算値をｘ軸、対数値をｙ軸と考えた場合に、前記範囲において形成される直線の傾きおよびｙ切片を算出するステップと、
前記傾きの自然対数を中心的傾向の特性値ｕ、切片をバックグラウンド値ｇと決定して、これらを記憶手段に記憶するステップとを備えたことを特徴とする遺伝子発現データの処理方法。A method of processing gene expression data to process the array data obtained based on the gene expression level to obtain analyzable data,
Obtaining the array data, sorting the data values of the obtained array data, and temporarily storing the sorted data in a storage means;
Calculating a standard value of a normal distribution corresponding to the sorted data values;
Setting a characteristic value s of the fluctuation for the data value, storing the characteristic value s in a storage unit, and multiplying each of the standard values to obtain a multiplication value;
Comparing the data value and the multiplication value to obtain a range in which the ratio of the two values is constant;
Calculating the slope and y-intercept of the straight line formed in the range when the multiplication value is considered as the x-axis and the logarithmic value as the y-axis;
And determining the natural logarithm of the slope as the characteristic value u of the central tendency and the intercept as the background value g and storing them in a storage means.

さらに、
ｘｉ＝（１０^ｕ）＊（１０^{（ｓ＊Ｚｉ）}）＋ｇ
（ただし、Ｚｉは、第ｉ番目の標準値）を用いて、ｘｉを解き、これを、記憶手段に一時的に記憶するステップと、
前記ｘｉとして利用することができる値の下限値を求め、これを前記記憶手段に記憶するステップとを備えたことを特徴とする請求項１０に記載の方法。further,
xi = (10 ^u ) * (10 ^{(s * Zi)} ) + g
(Where Zi is the i-th standard value), xi is solved, and this is temporarily stored in the storage means;
The method according to claim 10, further comprising: obtaining a lower limit value of a value that can be used as the xi, and storing the lower limit value in the storage unit.

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、
前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、
複数のバックグラウンド候補を選択して、これを一時的に記憶手段に記憶するステップと、
前記抽出されたデータ値のそれぞれから、各バックグラウンド候補の値を減じて、減算値を取得し、かつ、各減算値を対数変換した対数値を得て、当該対数値を一時的に記憶手段に記憶するステップと、
前記対数値のそれぞれに対応する、正規分布の標準値を算出するステップと、
前記各バックグラウンド候補について、各対数値と標準値との間の差異を示す指標を算出するステップと、
前記指標に基づき、前記バックグラウンド候補の値の範囲を絞り込むステップと、
前記減算値および対数値の取得、差異を示す指標の算出、バックグラウンド候補の値の絞込みを繰り返すことにより、バックグラウンド値を決定するステップと、
前記決定されたバックグラウンド値に関連して一時的に記憶された対数値を、それぞれ標準化し、標準化された値を、それぞれ、記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラム。A computer-readable program for operating the computer to process the array data obtained based on the gene expression level to obtain analyzable data,
The array data is acquired, the data values of the acquired array data are sorted, a predetermined number of data values are extracted from the sorted data values at predetermined intervals, and this is temporarily stored in the storage means. Steps,
Selecting a plurality of background candidates and temporarily storing them in storage means;
From each of the extracted data values, the value of each background candidate is subtracted to obtain a subtraction value, and a logarithmic value obtained by logarithmically converting each subtraction value is obtained, and the logarithmic value is temporarily stored. The step of storing in
Calculating a standard value of a normal distribution corresponding to each of the logarithmic values;
Calculating an index indicating a difference between each logarithmic value and a standard value for each background candidate;
Narrowing a range of values of the background candidates based on the indicator;
Determining the background value by repeatedly obtaining the subtraction value and logarithmic value, calculating the index indicating the difference, and narrowing down the background candidate values;
Standardizing logarithm values temporarily stored in relation to the determined background value, and storing the standardized values in storage means, respectively, to cause the computer to execute. Program.

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、
前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、前記ソートされたデータ値から、所定間隔で所定数のデータ値を抽出し、これを一時的に記憶手段に記憶するステップと、
バックグラウンド値γを決定して、これを記憶手段に記憶するステップと、
前記バックグランド値を減じたデータ値である減算値を対数化して、対数値を取得し、これを記憶手段に一時的に記憶するステップと、
前記対数値を参照して、中心的傾向の特性値μおよび変動の特性値σを算出し、これらを記憶手段に記憶するステップと、
各データ値ｘについて、標準値ｚとして
ｚ＝（ｌｏｇ（ｘ−γ）−μ）／σを算出して、算出された標準値ｚを、それぞれ記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラム。A computer-readable program for operating the computer to process the array data obtained based on the gene expression level to obtain analyzable data,
The array data is acquired, the data values of the acquired array data are sorted, a predetermined number of data values are extracted from the sorted data values at predetermined intervals, and this is temporarily stored in the storage means. Steps,
Determining a background value γ and storing it in a storage means;
Logarithmically the subtraction value, which is a data value obtained by subtracting the background value, obtaining a logarithmic value, and temporarily storing it in a storage means;
Calculating the characteristic value μ of the central tendency and the characteristic value σ of the fluctuation with reference to the logarithmic value, and storing them in the storage means;
For each data value x, z = (log (x−γ) −μ) / σ is calculated as the standard value z, and the calculated standard value z is stored in the storage unit. A program characterized by being executed.

前記バックグラウンド値γを決定するステップにおいて、
複数のバックグラウンド候補を選択して、これを一時的に記憶手段に記憶するステップと、
前記抽出されたデータ値のそれぞれから、各バックグラウンド候補の値を減じて、減算値を取得し、かつ、各減算値を対数変換した対数値を得て、当該対数値を一時的に記憶手段に記憶するステップと、
前記対数値のそれぞれに対応する、正規分布の標準値を算出するステップと、
前記各バックグラウンド候補について、各対数値と標準値との間の差異を示す指標を算出するステップと、
前記指標に基づき、前記バックグラウンド候補の値の範囲を絞り込むステップとを、前記コンピュータに実行させ、
前記減算値および対数値の取得、差異を示す指標の算出、バックグラウンド候補の値の絞込みを繰り返すことにより、バックグラウンド値を決定するように、前記コンピュータを動作させることを特徴とする請求項１３に記載のプログラム。In determining the background value γ,
Selecting a plurality of background candidates and temporarily storing them in storage means;
From each of the extracted data values, the value of each background candidate is subtracted to obtain a subtraction value, and a logarithmic value obtained by logarithmically converting each subtraction value is obtained, and the logarithmic value is temporarily stored. The step of storing in
Calculating a standard value of a normal distribution corresponding to each of the logarithmic values;
Calculating an index indicating a difference between each logarithmic value and a standard value for each background candidate;
Narrowing the range of values of the background candidates based on the indicator, causing the computer to execute,
14. The computer is operated so as to determine a background value by repeatedly obtaining the subtraction value and logarithmic value, calculating an index indicating a difference, and narrowing down background candidate values. The program described in.

前記中心的傾向の特性値μおよび変動の特性値σを求めるステップにおいて、
前記対数値のそれぞれに対応する標準値を算出するステップと、
前記対数値と標準値とを比較し、両者の比がほぼ一定に推移する範囲を求めるステップと、
前記標準値をｘ軸、対数値をｙ軸と考えた場合に、前記範囲において形成される直線の傾きおよびｙ切片を算出するステップと、
算出されたｙ切片を中心的傾向の特性値μと決定し、傾きを変動の特性値σと決定するステップとを、前記コンピュータに実行させることを特徴とする請求項１３または１４に記載のプログラム。In the step of determining the characteristic value μ of the central tendency and the characteristic value σ of fluctuation,
Calculating a standard value corresponding to each of the logarithmic values;
Comparing the logarithmic value with a standard value, and determining a range in which the ratio of the two changes substantially constant;
Calculating the slope and y intercept of a straight line formed in the range when the standard value is considered as the x axis and the logarithmic value as the y axis;
15. The program according to claim 13, further comprising: causing the computer to execute a step of determining the calculated y-intercept as a characteristic value μ of a central tendency and determining a slope as a characteristic value σ of variation. .

さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替え、その順序で記憶手段に一時的に記憶するステップと、
前記チップにおいてスポットが配置された列或いは行に関して、当該列或いは行ごとのデータ値の傾向を示す指標を算出するステップと、
前記指標に基づき、列或いは行ごとに特徴がある場合に、各列或いは各行について、それぞれ、そのデータ値の中央値を算出するステップと、
前記データ値を、対応する中央値で除して、除算値を取得して、これを記憶手段に一時的に記憶するステップとを、前記コンピュータに実行させ、
前記一時的に記憶された除算値を、アレイデータのデータ値に対応する値として、演算対象とすることを特徴とする請求項１２ないし１５の何れか一項に記載のプログラム。Furthermore, rearranging the data values in the order of the spots arranged on the chip, and temporarily storing them in the storage means in that order;
Calculating an index indicating a tendency of a data value for each column or row with respect to a column or row in which spots are arranged in the chip;
Calculating the median of the data values for each column or each row when there is a feature for each column or row based on the index; and
Dividing the data value by the corresponding median value to obtain a division value and temporarily storing it in a storage means, causing the computer to execute,
The program according to any one of claims 12 to 15, wherein the temporarily stored division value is a calculation target as a value corresponding to a data value of array data.

前記傾向を示す指標を算出するステップにおいて、特定の列或いは行に関する移動平均を算出するステップを、前記コンピュータに実行させることを特徴とする請求項１６に記載の方法。The method according to claim 16, further comprising: causing the computer to perform a step of calculating a moving average relating to a specific column or row in the step of calculating the index indicating the tendency.

さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替え、その順序で記憶手段に一時的に記憶するステップと、
前記順序で、データ値の周期性を見出すステップと、
前記周期性のある場合に、各データ値から、当該周期の中心的傾向の特性値を減じて減算値を算出し、これを記憶手段に一時的に記憶するステップとを、前記コンピュータに実行させ、
前記一時的に記憶された減算値を、アレイデータのデータ値に対応する値として、演算対象とすることを特徴とする請求項１２ないし１７の何れか一項に記載のプログラム。Furthermore, rearranging the data values in the order of the spots arranged on the chip, and temporarily storing them in the storage means in that order;
Finding the periodicity of the data values in said order;
In the case of the periodicity, the computer is caused to execute a step of calculating a subtraction value by subtracting the characteristic value of the central tendency of the cycle from each data value and temporarily storing the subtraction value in the storage means. ,
The program according to any one of claims 12 to 17, wherein the temporarily stored subtraction value is a calculation target as a value corresponding to a data value of array data.

さらに、前記データ値を、前記チップ上に配置されたスポットの順に並べ替えるステップと、
前記チップにおいてスポットが配置された列或いは行に関して、当該列または行ごとに、データ値の中心的傾向の特性値を算出するステップと、
前記中心的傾向の特性値に基づき、当該列或いは行に属するスポットに関するバックグラウンド値を設定し、当該スポットに関するデータ値のそれぞれから、バックグラウンド値を減じて減算値を算出するステップと、
前記減算値を、それぞれ対数化して、対数値を取得するステップと、
前記列或いは行に関して、前記対数値の中心的傾向の特性値を減算し、前記減算値を、記憶手段に一時的に記憶するステップとを、前記コンピュータに実行させ、
前記一時的に記憶された減算値を、アレイデータのデータ値に対応する値として、演算対象とすることを特徴とする請求項１２ないし１５の何れか一項に記載のプログラム。Reordering the data values in the order of the spots arranged on the chip;
Calculating the characteristic value of the central tendency of the data value for each column or row with respect to the column or row where the spot is arranged in the chip;
Setting a background value for a spot belonging to the column or row based on the characteristic value of the central tendency, subtracting the background value from each of the data values for the spot, and calculating a subtraction value;
Logarithmically each of the subtraction values to obtain a logarithmic value;
Subtracting the characteristic value of the central tendency of the logarithmic value with respect to the column or row, and temporarily storing the subtracted value in a storage means;
The program according to any one of claims 12 to 15, wherein the temporarily stored subtraction value is a calculation target as a value corresponding to a data value of array data.

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、
前記チップにおいてスポットが配置された列或いは行に関して、当該列または行ごとに、データ値の中心的傾向の特性値を算出するステップと、
前記中心的傾向の特性値に基づき、当該列或いは行に属するスポットに関するバックグラウンド値の候補を設定し、当該スポットに関するデータ値のそれぞれから、バックグラウンド候補値を減じて減算値を算出するステップと、
前記減算値を、それぞれ対数化して、対数値を取得するステップと、
前記列或いは行に関して、前記対数値の中心的傾向の特性値を算出し、前記対数値のそれぞれから減じて第２の減算値を算出するステップと、
前記列或いは行に関して、前記データ値を、前記第２の減算値に基づき算出される変動の特性値で除して、除算値を取得し、これを記憶手段に一時的に記憶するステップと、
前記除算値と、対応する標準値とを比較し、これらの間の差異の指標が最も小さくなるような、バックグラウンド候補値をバックグラウンド値γと決定するステップと、
前記バックグラウンド値γ、当該バックグラウンド値γと関連する中心的傾向の特性値μおよび変動の特性値σを、それぞれ記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラム。A computer-readable program for operating the computer to process the array data obtained based on the gene expression level to obtain analyzable data,
Calculating the characteristic value of the central tendency of the data value for each column or row with respect to the column or row where the spot is arranged in the chip;
Setting a candidate for a background value relating to a spot belonging to the column or row based on the characteristic value of the central tendency, and calculating a subtraction value by subtracting the background candidate value from each of the data values relating to the spot; ,
Logarithmically each of the subtraction values to obtain a logarithmic value;
Calculating a characteristic value of a central tendency of the logarithmic value for the column or row, and subtracting from each of the logarithmic values to calculate a second subtraction value;
Dividing the data value with respect to the column or row by the characteristic value of the fluctuation calculated based on the second subtraction value, obtaining a division value, and temporarily storing it in a storage means;
Comparing the division value with a corresponding standard value and determining a background candidate value as a background value γ such that an indicator of the difference between them is minimized;
Storing the background value γ, the characteristic value μ of the central tendency associated with the background value γ, and the characteristic value σ of fluctuation in the storage unit, respectively, in the computer. .

遺伝子の発現量に基づき得られたアレイデータを処理して、解析可能なデータを取得するようにコンピュータを動作させる、コンピュータにより読み取り可能なプログラムであって、
前記アレイデータを取得して、取得されたアレイデータのデータ値をソートし、ソートされたデータを、記憶手段に一時的に記憶するステップと、
前記ソートされたデータ値に対応する、正規分布の標準値を算出するステップと、
前記データ値に関して、その変動の特性値ｓを設定して、これを記憶手段に記憶するとともに、前記標準値のそれぞれに乗じて、乗算値を得るステップと、
前記データ値と乗算値とを比較し、両者の比が一定に推移する範囲を求めるステップと、
前記乗算値をｘ軸、対数値をｙ軸と考えた場合に、前記範囲において形成される直線の傾きおよびｙ切片を算出するステップと、
前記傾きの自然対数を中心的傾向の特性値ｕ、切片をバックグラウンド値ｇと決定して、これらを記憶手段に記憶するステップとを、前記コンピュータに実行させることを特徴とするプログラム。A computer-readable program for operating the computer to process the array data obtained based on the gene expression level to obtain analyzable data,
Obtaining the array data, sorting the data values of the obtained array data, and temporarily storing the sorted data in a storage means;
Calculating a standard value of a normal distribution corresponding to the sorted data values;
Setting a characteristic value s of the fluctuation for the data value, storing the characteristic value s in a storage unit, and multiplying each of the standard values to obtain a multiplication value;
Comparing the data value and the multiplication value to obtain a range in which the ratio of the two values is constant;
Calculating the slope and y-intercept of the straight line formed in the range when the multiplication value is considered as the x-axis and the logarithmic value as the y-axis;
A program for causing the computer to execute a step of determining a natural logarithm of the slope as a characteristic value u of a central tendency and an intercept as a background value g and storing them in a storage means.

さらに、
ｘｉ＝（１０^ｕ）＊（１０^{（ｓ＊Ｚｉ）}）＋ｇ
（ただし、Ｚｉは、第ｉ番目の標準値）を用いて、ｘｉを解き、これを、記憶手段に一時的に記憶するステップと、
前記ｘｉとして利用することができる値の下限値を求め、これを前記記憶手段に記憶するステップとを前記コンピュータに実行させることを特徴とする請求項２１に記載のプログラム。further,
xi = (10 ^u ) * (10 ^{(s * Zi)} ) + g
(Where Zi is the i-th standard value), xi is solved, and this is temporarily stored in the storage means;
The program according to claim 21, wherein the computer is caused to execute a step of obtaining a lower limit value of a value that can be used as the xi and storing the lower limit value in the storage unit.