JP2008158748A

JP2008158748A - Variable selection device and method, and program

Info

Publication number: JP2008158748A
Application number: JP2006345996A
Authority: JP
Inventors: Yoichi Kitahara; 原洋一北; Kentaro Torii; 居健太郎鳥; Ryohei Orihara; 原良平折
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2006-12-22
Filing date: 2006-12-22
Publication date: 2008-07-10

Abstract

<P>PROBLEM TO BE SOLVED: To construct a model at high speed even when there are plenty of sample numbers and explanatory variable candidates. <P>SOLUTION: In this variable selection method, an explanatory variable used for generating the model for calculating probability that a prescribed phenomenon occurs or does not occur is selected by use of a set of samples each including a plurality of explanatory variables each having a first value or a second value, and an object variable wherein the presence/absence of the occurrence of the prescribed phenomenon is expressed by the first value or the second value. Frequency of the sample wherein the object variable has the first value and the second value is counted as first frequency and second frequency, frequency of the sample wherein the explanatory variable has the first value and wherein the object variable has the first value is counted as third frequency in each explanatory variable, frequency of the sample wherein the explanatory variable has the first value and wherein the object variable has the second value is counted as fourth frequency in each explanatory variable, a characteristic amount of each explanatory variable is calculated by use of the first frequency, the second frequency, and the third frequency and the fourth frequency obtained in each explanatory variable, and one or more explanatory variables are selected based on each calculated characteristic amount. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、例えば特定の変数が特定の値をとる確率を計算するモデルを構築するために用いる説明変数を選択する変数選択装置、方法およびプログラムに関する。 The present invention relates to a variable selection apparatus, method, and program for selecting explanatory variables used to construct a model for calculating a probability that a specific variable takes a specific value, for example.

要因を分析するもしくは将来の現象を予測するためには現象を記述するためのモデルを用いる。例えば、予防医学においては、疾病リスクと疾病要因との関係を表すために、疾病の有無を目的変数とし、問診データや検査データを説明変数とするロジスティック回帰モデルを利用することが多い。また、機械の故障予測においては、異常動作リスクと故障要因との関係を表すために、故障の有無を目的変数とし、機械の状態を示すパラメータ等を説明変数とするロジスティック回帰モデルを利用することで、機械の管理が容易になる。また、特許文献１では、画像形成装置において利用する不快確率をロジスティック回帰モデルによってあらわしている。 In order to analyze factors or predict future phenomena, a model for describing phenomena is used. For example, in preventive medicine, in order to express the relationship between a disease risk and a disease factor, a logistic regression model is often used in which the presence or absence of a disease is an objective variable and inquiry data or examination data is an explanatory variable. In machine failure prediction, in order to express the relationship between abnormal operation risk and failure factors, use a logistic regression model with the presence or absence of failure as an objective variable and parameters indicating the state of the machine as explanatory variables. This makes it easier to manage the machine. In Patent Document 1, the uncomfortable probability used in the image forming apparatus is represented by a logistic regression model.

従来、説明変数の候補が多数存在する場合、非特許文献1に記載されているように、Newton-Raphson法のような繰り返し収束法を行って説明変数の係数を推定し、AICに代表される情報量基準に基づいて変数選択を行っていた。
特開2004-219976号公報丹後俊郎, 山岡和枝, 高木晴良, ロジスティック回帰分析 SASを利用した統計解析の実際, 朝倉書店, 1996 Conventionally, when there are a large number of explanatory variable candidates, as described in Non-Patent Document 1, an iterative convergence method such as Newton-Raphson method is performed to estimate the coefficient of the explanatory variable, which is represented by AIC. Variable selection was performed based on information criteria.
JP 2004-219976 A Tango Toshiro, Yamaoka Kazue, Takagi Haruyoshi, Logistic Regression Practical statistical analysis using SAS, Asakura Shoten, 1996

説明変数の推定に繰り返し収束法を用いる従来装置においては、高精度なモデルを構築可能である反面、説明変数の候補が多数存在すると、変数選択に多大な時間を要するという問題があった。例えば、医療分野においては、疾病が疑われた場合多数の検査が行われ多数の観点から疾病の可能性を考慮するため、説明変数候補が多数存在する場合がある。例えば、故障診断においては、多数の検査が行われ多数の観点から動作異常の可能性を考慮するため、説明変数候補が多数存在する場合がある。例えば、情報検索やテキストマイニングの分野では、文章内に出現する多数語句を分析の対象とするため、説明変数候補が多数存在する場合がある。このような場合において、従来の繰り返し収束法を用いると、変数選択に多大な時間を要する。 In the conventional apparatus using the iterative convergence method for estimation of explanatory variables, a high-accuracy model can be constructed. However, when there are many explanatory variable candidates, there is a problem that it takes a long time to select a variable. For example, in the medical field, there are cases where a large number of explanatory variable candidates exist in order to consider the possibility of a disease from a number of viewpoints when a large number of tests are performed when a disease is suspected. For example, in failure diagnosis, there are cases where a large number of explanatory variable candidates exist in order to consider the possibility of abnormal operation from a large number of viewpoints. For example, in the fields of information retrieval and text mining, many words and phrases appearing in a sentence are targeted for analysis, so there may be many explanatory variable candidates. In such a case, if the conventional iterative convergence method is used, it takes a long time to select a variable.

本発明は、標本数と説明変数候補が多い場合でも高速にモデルを構築することを可能とした変数選択装置、方法およびプログラムを提供する。 The present invention provides a variable selection apparatus, method, and program capable of building a model at high speed even when there are a large number of samples and explanatory variable candidates.

本発明の一態様としての変数選択装置は、
第１値または第２値を有する複数の説明変数と、所定事象の発生の有無を前記第１値および第２値によって表す目的変数とを含むサンプルの集合を用いて、前記所定事象が発生するまたはしない確率を計算するためのモデルを生成するために用いる説明変数を選択する変数選択装置であって、
前記目的変数が第１値をもつサンプルの頻度を第１頻度として、前記目的変数が第２値をもつサンプルの頻度を第２頻度として計数し、
前記説明変数ごとに、前記説明変数が第１値であり前記目的変数が第１値であるサンプルの頻度を第３頻度、前記説明変数が第１値であり前記目的変数が第２値であるサンプルの頻度を第４頻度として計数する頻度計数部と、
前記第１頻度と、第２頻度と、説明変数ごとに得られた前記第３頻度と、説明変数ごとに得られた前記第４頻度とを用いて、各前記説明変数の特徴量をそれぞれ算出する特徴量算出部と、
算出された各前記特徴量に基づき１つ以上の説明変数を選択する変数選択部と、
を備えた。 The variable selection device as one aspect of the present invention is:
The predetermined event occurs by using a set of samples including a plurality of explanatory variables having a first value or a second value and an objective variable representing whether or not a predetermined event has occurred by the first value and the second value. A variable selection device that selects explanatory variables used to generate a model for calculating a probability of not or not,
Counting the frequency of samples having the first value of the objective variable as the first frequency, and counting the frequency of samples having the second value of the objective variable as the second frequency,
For each explanatory variable, the frequency of the sample where the explanatory variable is the first value and the objective variable is the first value is the third frequency, the explanatory variable is the first value, and the objective variable is the second value A frequency counting unit for counting the frequency of the sample as the fourth frequency;
The feature amount of each explanatory variable is calculated using the first frequency, the second frequency, the third frequency obtained for each explanatory variable, and the fourth frequency obtained for each explanatory variable. A feature amount calculation unit to
A variable selection unit that selects one or more explanatory variables based on the calculated feature quantities;
Equipped with.

本発明の一態様としての変数選択方法は、
第１値または第２値を有する複数の説明変数と、所定事象の発生の有無を前記第１値および第２値によって表す目的変数とを含むサンプルの集合を用いて、前記所定事象が発生するまたはしない確率を計算するためのモデルを生成するために用いる説明変数を選択する変数選択方法であって、
前記目的変数が第１値をもつサンプルの頻度を第１頻度として、前記目的変数が第２値をもつサンプルの頻度を第２頻度として計数し、
前記説明変数ごとに、前記説明変数が第１値であり前記目的変数が第１値であるサンプルの頻度を第３頻度、前記説明変数が第１値であり前記目的変数が第２値であるサンプルの頻度を第４頻度として計数し、
前記第１頻度と、第２頻度と、説明変数ごとに得られた前記第３頻度と、説明変数ごとに得られた前記第４頻度とを用いて、各前記説明変数の特徴量をそれぞれ算出し、
算出された各前記特徴量に基づき１つ以上の説明変数を選択する、
ことを特徴とする。 The variable selection method as one aspect of the present invention includes:
The predetermined event occurs by using a set of samples including a plurality of explanatory variables having a first value or a second value and an objective variable representing whether or not a predetermined event has occurred by the first value and the second value. A variable selection method for selecting an explanatory variable used to generate a model for calculating a probability of not or not,
Counting the frequency of samples having the first value of the objective variable as the first frequency, and counting the frequency of samples having the second value of the objective variable as the second frequency,
For each explanatory variable, the frequency of the sample where the explanatory variable is the first value and the objective variable is the first value is the third frequency, the explanatory variable is the first value, and the objective variable is the second value Count the sample frequency as the fourth frequency,
The feature amount of each explanatory variable is calculated using the first frequency, the second frequency, the third frequency obtained for each explanatory variable, and the fourth frequency obtained for each explanatory variable. And
Selecting one or more explanatory variables based on the calculated feature quantities;
It is characterized by that.

本発明の一態様としてのプログラムは、
第１値または第２値を有する複数の説明変数と、所定事象の発生の有無を前記第１値および第２値によって表す目的変数とを含むサンプルの集合を用いて、前記所定事象が発生するまたはしない確率を計算するためのモデルを生成するために用いる説明変数を選択するためのプログラムであって、
前記目的変数が第１値をもつサンプルの頻度を第１頻度として、前記目的変数が第２値をもつサンプルの頻度を第２頻度として計数するステップと、
前記説明変数ごとに、前記説明変数が第１値であり前記目的変数が第１値であるサンプルの頻度を第３頻度、前記説明変数が第１値であり前記目的変数が第２値であるサンプルの頻度を第４頻度として計数するステップと、
前記第１頻度と、第２頻度と、説明変数ごとに得られた前記第３頻度と、説明変数ごとに得られた前記第４頻度とを用いて、各前記説明変数の特徴量をそれぞれ算出するステップと、
算出された各前記特徴量に基づき１つ以上の説明変数を選択するステップと、
をコンピュータに実行させる。 The program as one aspect of the present invention is:
The predetermined event occurs by using a set of samples including a plurality of explanatory variables having a first value or a second value and an objective variable representing whether or not a predetermined event has occurred by the first value and the second value. A program for selecting explanatory variables used to generate a model for calculating a probability of not or not,
Counting the frequency of samples having the first value of the objective variable as a first frequency and counting the frequency of samples having the second value of the objective variable as a second frequency;
For each explanatory variable, the frequency of the sample where the explanatory variable is the first value and the objective variable is the first value is the third frequency, the explanatory variable is the first value, and the objective variable is the second value Counting the frequency of the samples as a fourth frequency;
The feature amount of each explanatory variable is calculated using the first frequency, the second frequency, the third frequency obtained for each explanatory variable, and the fourth frequency obtained for each explanatory variable. And steps to
Selecting one or more explanatory variables based on the calculated feature quantities;
Is executed on the computer.

本発明によれば、標本数と説明変数候補が多い場合でも高速にモデルを構築することができる。 According to the present invention, a model can be constructed at high speed even when the number of samples and explanatory variable candidates are large.

以下、本発明の実施の形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

まず、本発明の実施の形態にて用いる用語について説明する。 First, terms used in the embodiment of the present invention will be described.

図２は、解析対象となる複数の標本（サンプル）からなるデータ（初期サンプルの集合）を表形式にて表した例である。各標本は識別子を有し、識別子の行と各項目の列との交差部に各項目の値が示される。例えば、識別子００１のリスク値は０となる。 FIG. 2 is an example in which data (a set of initial samples) composed of a plurality of specimens (samples) to be analyzed is represented in a table format. Each sample has an identifier, and the value of each item is shown at the intersection of the identifier row and each item column. For example, the risk value of the identifier 001 is 0.

図２で表されるようなデータにおいて、記述したい特定の項目の値を、記述したい特定の項目を除く項目の値を用いて近似的に表すための式はモデルと呼ばれる。モデルにおいて、記述したい特定の項目に対応する変数は目的変数と呼ばれ、目的変数を記述するために用いられる項目に対応する変数は説明変数と呼ばれる。本実施の形態においては、目的変数の取りうる値は２値のみとする。以下の説明において、目的変数は１もしくは０のみしか取らないとする。 In the data as shown in FIG. 2, an expression for approximating the value of a specific item to be described using the value of an item excluding the specific item to be described is called a model. In the model, a variable corresponding to a specific item to be described is called an objective variable, and a variable corresponding to an item used for describing the objective variable is called an explanatory variable. In the present embodiment, the target variable can take only two values. In the following description, it is assumed that the objective variable takes only 1 or 0.

目的変数が１となる確率をｐ、ｋ番目の説明変数をｘ_ｋ、説明変数ｘ_ｋの係数をβ_ｋとしたときの、ロジスティック回帰モデルは数式１によって表される。

The logistic regression model is expressed by Equation 1 when the probability that the objective variable is 1 is p, the k-th explanatory variable is x _k , and the coefficient of the explanatory variable x _k is β _k .

ここで、ｋは説明変数を一意に識別するための番号である。例えば、図２において、リスク項目を目的変数とすれば、ｐはリスク項目の値が１となる確率を表している。また、目的変数に対応する項目を除いて左の列から項目に番号を振ることで説明変数を一意に識別する番号を決めるならば、睡眠時間項目に対応する説明変数はｘ_１となり、ｘ_１に対応する係数はβ_１となる。

は目的変数が０となる確率に対する１となるときの確率の比を表しており、オッズと呼ばれる。 Here, k is a number for uniquely identifying the explanatory variable. For example, in FIG. 2, if the risk item is an objective variable, p represents the probability that the value of the risk item is 1. If a number for uniquely identifying the explanatory variable is determined by assigning a number to the item from the left column except for the item corresponding to the objective variable, the explanatory variable corresponding to the sleep time item is x ₁ and x ₁ The coefficient corresponding to is β ₁ .

Represents the ratio of the probability when the objective variable becomes 1 to the probability that the objective variable becomes 0, and is called odds.

ロジスティック回帰モデルは、目的変数が２値であり、説明変数が数値として表されるならば広く利用することができる。例えば、目的変数を疾病の有無とし、説明変数を疾病要因候補とするならば、発症確率と疾病要因の関係を表すことができる。説明変数の係数から疾病要因が疾病に与える影響の強さを知ることができるから、疾病に影響のある要因を排除することにより疾病予防が可能である。また、目的変数が１となる確率から、特定の人が疾病になる確率を知ることが可能であるし、疾病になりやすい人を発見することが可能である。また、例えば、目的変数を機械の故障の有無とし、説明変数を機械の状態とするならば、故障しやすい機械を発見する、もしくは、故障の要因を特定することに活用することができる。 The logistic regression model can be widely used if the objective variable is binary and the explanatory variable is expressed as a numerical value. For example, if the objective variable is the presence or absence of a disease and the explanatory variable is a disease factor candidate, the relationship between the onset probability and the disease factor can be expressed. Since the strength of the influence of the disease factor on the disease can be known from the coefficient of the explanatory variable, the disease can be prevented by eliminating the factor affecting the disease. Further, from the probability that the objective variable is 1, it is possible to know the probability that a specific person will be ill, and it is possible to find a person who is likely to be ill. For example, if the objective variable is the presence or absence of a machine failure and the explanatory variable is the machine state, it can be used to find a machine that is likely to fail or to identify the cause of the failure.

モデルは、安定的に高精度であり、実用上問題ない範囲で構築されることが望まれる。例えば、一部のデータに関して高精度であっても、一時的なノイズに影響されるモデルであっては、他の類似データに関して適用できないため、安定的に高精度ではない。また、高精度なモデルが構築されたとしても、実用上問題ない時間でモデルが構築される必要がある。安定的に高精度なモデルを短い時間で構築するためには、現象を記述するために最低限必要な説明変数のみを用いる必要があることが一般的に知られている。数式１のモデルの場合、目的変数が１となる確率ｐを精度よく記述するためには、一般に複数存在する項目の中から適切な説明変数を選択し、選択された説明変数の係数を決めるといった処理を短時間に行うことが必要である。 It is desired that the model is stably and highly accurate and is constructed within a practically acceptable range. For example, even if the accuracy is high with respect to some data, a model that is affected by temporary noise cannot be applied with respect to other similar data, and thus is not stably highly accurate. Even if a high-accuracy model is constructed, the model needs to be constructed in a time that does not cause any practical problems. It is generally known that in order to build a stable and highly accurate model in a short time, it is necessary to use only the minimum necessary explanatory variables to describe the phenomenon. In the case of the model of Equation 1, in order to accurately describe the probability p that the objective variable is 1, an appropriate explanatory variable is generally selected from a plurality of existing items, and the coefficient of the selected explanatory variable is determined. It is necessary to perform the processing in a short time.

図１は、本実施の形態としての変数選択装置の基本構成図である。 FIG. 1 is a basic configuration diagram of a variable selection device according to the present embodiment.

入力部１０１は、モデル構築に用いられるデータ（初期サンプルの集合）を入力する。入力部１０１としては用途に応じて最適な選択をすることが可能であり、キーボードでもよいし、マウスでもよいし、ペン型入力装置でもよいし、センサーでもよい。 The input unit 101 inputs data (set of initial samples) used for model construction. The input unit 101 can be optimally selected depending on the application, and may be a keyboard, a mouse, a pen-type input device, or a sensor.

データ記憶部１０２は、入力部１０１から入力されたデータを記憶する。データ記憶部１０２は、入力部１０１から入力されるべきデータを予め記憶しておいてもよい。この場合、入力部１０１を備えていなくてもよい。データ記憶部１０２に記憶されるデータの記憶形式は、用途に応じて最適な方法を選択することが可能であり、データベースによって記憶されてもよいし、テキストデータとして記憶されてもよいし、あるいは各アプリケーションに特化されたフォーマット形式のデータとして記憶されてもよい。なお、後述する頻度計数部での処理を行いやすくするために、データは目的変数の値によって整列させられていることが望ましい。 The data storage unit 102 stores data input from the input unit 101. The data storage unit 102 may store data to be input from the input unit 101 in advance. In this case, the input unit 101 may not be provided. The storage format of the data stored in the data storage unit 102 can select an optimum method according to the application, and may be stored by a database, stored as text data, or You may memorize | store as data of the format format specialized for each application. In order to facilitate the processing in the frequency counting unit described later, it is desirable that the data is arranged according to the value of the objective variable.

変数変換部１０３は、解析対象となるデータに２値変数でないデータが含まれている場合、それをあらかじめ与えられた変換規則にしたがって２値変数に変換する２値変換処理を行う。２値変数ではないデータには、離散化された値を取り得るカテゴリー変数と、連続的な値を取り得る連続変数がある。このような変数を０もしくは１のような２値しか取らない変数に変換する。変換する方法は、変換前の情報を著しく損失させなければ用途に応じて最適な方法を選択することが可能であり、一般的な数量化の方法を用いてもよいし、事前情報に基づく閾値を利用して変換する方法を用いてもよい。２値変換処理を施された解析対象となるデータは頻度計数部１０４に送られる。なお２値変換処理を施された解析対象となるデータをデータ記憶部１０２に記憶させ、頻度計数部１０４がデータ記憶部１０２にアクセスして該データを取得してもよい。なお、解析対象となるデータが全て２値変数で構成されているのならば、変数変換部１０３を備えていなくともよい。 When the data to be analyzed includes data that is not a binary variable, the variable conversion unit 103 performs a binary conversion process for converting the data into a binary variable according to a conversion rule given in advance. Data that is not a binary variable includes a categorical variable that can take a discrete value and a continuous variable that can take a continuous value. Such a variable is converted into a variable such as 0 or 1 that takes only two values. As the conversion method, it is possible to select an optimum method according to the application unless the information before conversion is significantly lost. A general quantification method may be used, and a threshold value based on prior information may be used. A method of converting using may be used. The data to be analyzed that has been subjected to the binary conversion process is sent to the frequency counting unit 104. The data to be analyzed that has been subjected to the binary conversion process may be stored in the data storage unit 102, and the frequency counting unit 104 may access the data storage unit 102 to acquire the data. Note that if the data to be analyzed is all composed of binary variables, the variable conversion unit 103 may not be provided.

図３は、図２のデータをカテゴリー変数については数量化を行い、連続変数については離散化を行うことで、２値変数ではないデータを２値変数に変換した例を示している。図２における睡眠時間項目は、連続変数であるが、適宜離散化を行うことで０もしくは１をとる２値変数に変換される。この場合は、７以上を１とし７未満を０とすることで連続変数を２値変数に変換している。例えば、図２における識別子００１の睡眠時間項目は９であるため、２値変数に変換後の図３において識別子００１の睡眠時間項目は１となっている。離散化の方法については、これに限らず適切な方法を採用してよい。 FIG. 3 shows an example in which data that is not a binary variable is converted into a binary variable by quantifying the data of FIG. 2 for a categorical variable and discretizing a continuous variable. The sleep time item in FIG. 2 is a continuous variable, but is converted into a binary variable that takes 0 or 1 by appropriately discretizing. In this case, a continuous variable is converted into a binary variable by setting 7 or more to 1 and setting less than 7 to 0. For example, since the sleep time item of identifier 001 in FIG. 2 is 9, the sleep time item of identifier 001 is 1 in FIG. The discretization method is not limited to this, and an appropriate method may be adopted.

また、図２における睡眠状態項目は、良もしくは中もしくは悪という３値を取り得るカテゴリー変数であるが、適宜数量化を行うことで０もしくは１をとる２値変数に変換される。具体的に、睡眠状態項目を睡眠状態１項目と睡眠状態２項目の２つの項目に分け、２つの項目の値の組み合わせによりカテゴリー変数の取り得る全ての状態を２値変数に変換する。図３の例では、睡眠状態項目の良は、睡眠状態１項目を０、睡眠状態２項目を０とし、睡眠状態項目の中は、睡眠状態１項目を１、睡眠状態２項目を０、睡眠状態項目の悪は、睡眠状態１項目を０、睡眠状態２項目を１とする。数量化の方法については、これに限らず目的に応じて適切な方法を採用してよい。 The sleep state item in FIG. 2 is a categorical variable that can take three values of good, medium, or bad, but is converted into a binary variable that takes 0 or 1 by appropriately quantifying. Specifically, the sleep state item is divided into two items of a sleep state 1 item and a sleep state 2 item, and all states that can be taken by the categorical variable are converted into binary variables by combining the values of the two items. In the example of FIG. 3, the sleep state item is “0” for the sleep state item and “0” for the two sleep state items. Among the sleep state items, the sleep state item is 1 and the sleep state item is 0. As for the badness of the state item, the sleep state 1 item is 0, and the sleep state 2 item is 1. The method of quantification is not limited to this, and an appropriate method may be adopted depending on the purpose.

頻度計数部１０４は、図４に示すように、変数非選択時頻度計数部２０１と、変数非選択時オッズ算出部２０２と、変数選択時頻度計数部２０３と、変数選択時オッズ算出部２０４と、から構成される。 As shown in FIG. 4, the frequency counting unit 104 includes a variable non-selection frequency counting unit 201, a variable non-selection odds calculation unit 202, a variable selection frequency counter 203, and a variable selection odds calculation unit 204. Is composed of.

変数非選択時頻度計数部２０１は、目的変数が０となる頻度と、目的変数が１となる頻度とを数え上げる。目的変数が０となる頻度とは、目的変数が０となっている標本の個数であり、目的変数が１となる頻度とは、目的変数が１となっている標本の個数である。例えば、図３では、目的変数に対応するリスク項目の列の値が０となっているものは識別子００１と００５と００６の３個であり、１となっているものは識別子００２と００３と００４と００７の４個であるから、目的変数が０となる頻度は３、目的変数が１となる頻度は４である。双方の頻度を足し合わせると７という全標本数が得られる。 The variable non-selection frequency counting unit 201 counts the frequency at which the objective variable becomes 0 and the frequency at which the objective variable becomes 1. The frequency at which the objective variable is 0 is the number of samples whose objective variable is 0, and the frequency at which the objective variable is 1 is the number of samples whose objective variable is 1. For example, in FIG. 3, identifiers 001, 005, and 006 have three values in the risk item column corresponding to the objective variable, and identifiers 002, 003, and 004 have a value of 1. And 007, the frequency at which the objective variable is 0 is 3, and the frequency at which the objective variable is 1 is 4. When both frequencies are added together, a total sample number of 7 is obtained.

変数非選択時オッズ算出部２０２は、変数非選択時頻度計数部２０１にて数え上げられた各頻度から全標本についてのオッズを算出する。全標本についてのオッズとは、目的変数が０となる頻度に対する目的変数が１となる頻度の比である。例えば、図３では、目的変数が０となる頻度は３、目的変数が１となる頻度は４であるから、全標本についてのオッズは４／３となる。 The variable non-selection odds calculation unit 202 calculates the odds for all the samples from the frequencies counted by the variable non-selection frequency counting unit 201. The odds for all samples are the ratio of the frequency at which the objective variable is 1 to the frequency at which the objective variable is 0. For example, in FIG. 3, the frequency at which the objective variable is 0 is 3, and the frequency at which the objective variable is 1 is 4, so the odds for all samples are 4/3.

変数選択時頻度計数部２０３は、説明変数が１のときに目的変数が０となる頻度と、説明変数が１のときに目的変数が１となる頻度を数え上げる。説明変数が１のときに目的変数が０となる頻度とは、説明変数が１かつ目的変数が０となっている標本の個数であり、説明変数が１のときに目的変数が１となる頻度とは、説明変数が１かつ目的変数が１となっている標本の個数である。説明変数が睡眠時間項目であるときを例に挙げると、図３において、睡眠時間項目の値が１かつ目的変数に対応するリスク項目の値が０となっているものは識別子００１と００６の２個であり、睡眠時間項目の値が１かつ目的変数に対応するリスク項目の値が１となっているものは識別子００２と００３と００４の３個であるから、説明変数が１のときに目的変数が０となる頻度は２、説明変数が１ときに目的変数が１となる頻度は３である。双方の頻度を足し合わせると５という、説明変数が１のときの標本数が得られる。 The variable selection frequency counting unit 203 counts the frequency at which the objective variable becomes 0 when the explanatory variable is 1 and the frequency at which the objective variable becomes 1 when the explanatory variable is 1. The frequency that the objective variable becomes 0 when the explanatory variable is 1 is the number of samples in which the explanatory variable is 1 and the objective variable is 0. The frequency that the objective variable becomes 1 when the explanatory variable is 1 Is the number of samples with one explanatory variable and one objective variable. Taking the case where the explanatory variable is a sleep time item as an example, in FIG. 3, the value of the sleep time item is 1 and the value of the risk item corresponding to the objective variable is 0, identifiers 001 and 006. Since there are three identifiers 002, 003 and 004, the value of the sleep time item is 1 and the value of the risk item corresponding to the objective variable is 1, so when the explanatory variable is 1, The frequency that the variable becomes 0 is 2, and the frequency that the objective variable becomes 1 when the explanatory variable is 1 is 3. When the frequency of both is added, the number of samples is 5 when the explanatory variable is 1.

変数選択時オッズ算出部２０４は、変数選択時頻度計数部２０３にて数え上げられた各頻度から説明変数が１となるときのオッズを算出する。説明変数が１となるときのオッズとは、説明変数が１のときに目的変数が０となる頻度に対する、説明変数が１のときに目的変数が１となる頻度の比である。例えば、図３において、説明変数が睡眠時間項目であるときを例に挙げると、説明変数が１のときに目的変数が０となる頻度は２、説明変数が１のとき目的変数が１となる頻度は３であるから、説明変数が１となるときのオッズは３／２となる。 The variable selection odds calculation unit 204 calculates the odds when the explanatory variable is 1 from each frequency counted by the variable selection frequency counting unit 203. The odds when the explanatory variable is 1 is the ratio of the frequency at which the objective variable is 1 when the explanatory variable is 1 to the frequency at which the objective variable is 0 when the explanatory variable is 1. For example, in FIG. 3, when the explanatory variable is a sleep time item, for example, when the explanatory variable is 1, the frequency at which the objective variable is 0 is 2, and when the explanatory variable is 1, the objective variable is 1. Since the frequency is 3, the odds when the explanatory variable is 1 are 3/2.

頻度計数部１０４は、このようにして計算した頻度とオッズを次段の係数推定部１０５に送る。なお、頻度とオッズをデータ記憶部１０２に記憶させ、係数推定部１０５がデータ記憶部１０２にアクセスすることでこれら頻度とオッズとを取得してもよい。 The frequency counting unit 104 sends the frequency and odds calculated in this way to the coefficient estimation unit 105 in the next stage. The frequency and odds may be stored in the data storage unit 102, and the coefficient estimation unit 105 may access the data storage unit 102 to acquire the frequency and odds.

係数推定部１０５は、頻度計数部１０４において算出された頻度とオッズを利用して、各説明変数の係数を推定する。推定された係数は特徴量算出部１０６に送られる。推定された係数は説明変数の係数を近似的に表す数式を用いて算出する。例えば、数式２のような式にて算出する。

The coefficient estimation unit 105 estimates the coefficient of each explanatory variable using the frequency and odds calculated by the frequency counting unit 104. The estimated coefficient is sent to the feature amount calculation unit 106. The estimated coefficient is calculated using a mathematical expression that approximately represents the coefficient of the explanatory variable. For example, it is calculated by an expression such as Expression 2.

数式２は、ロジスティック回帰モデルの係数を、頻度とオッズを用いて近似的に表した式であり、本発明者らによる独自の研究に基づき考案されたものである。Ｂ_ｋはｋ番目の説明変数の係数であり、Ｎは全標本数であり、Ｒは全標本についてのオッズであり、Ｎ_ｋはｋ番目の説明変数が１のときの標本数であり、Ｒ_ｋはｋ番目の説明変数が１のときのオッズである。 Formula 2 is an expression that approximates the coefficient of the logistic regression model using frequency and odds, and was devised based on original research by the present inventors. B _k is the coefficient of the kth explanatory variable, N is the total number of samples, R is the odds for all samples, N _k is the number of samples when the kth explanatory variable is 1, and R _k is the odds when the kth explanatory variable is 1.

図３の例では、上述したようにＮは７であり、Ｒは４／３である。また、睡眠時間項目に対応する説明変数について、Ｎ_ｋは５であり、Ｒ_ｋは３／２である。よって、睡眠時間項目に対応する説明変数の係数は、log((７−５)×(３／２)／(７×(４／３)−５×(３／２)))≒０．２１と算出される。睡眠時間項目の説明変数と同様に、他の説明変数についても係数を推定し表にまとめたものを図５に示す。 In the example of FIG. 3, N is 7 and R is 4/3 as described above. In addition, the explanatory variables corresponding to the sleep time item, _{N k} is 5, the _{R k} is 3/2. Therefore, the coefficient of the explanatory variable corresponding to the sleep time item is log ((7-5) × (3/2) / (7 × (4/3) -5 × (3/2))) ≈0.21. Is calculated. Similar to the explanatory variables of the sleep time item, the coefficients of other explanatory variables are estimated and summarized in a table in FIG.

このように、係数の推定に繰り返し収束法を用いずに、頻度計数部１０４において算出される頻度とオッズを用いて説明変数の係数を近似的に表すことができるなら、係数の推定に用いる数式は数式２以外でもよい。このように繰り返し収束法を用いないことにより短時間で係数を推定することが可能である。 Thus, if the coefficient of the explanatory variable can be expressed approximately using the frequency and odds calculated in the frequency counting unit 104 without using the iterative convergence method for estimating the coefficient, the mathematical formula used for estimating the coefficient May be other than Equation 2. Thus, the coefficient can be estimated in a short time by not using the iterative convergence method.

特徴量算出部１０６は、頻度計数部１０４において算出された頻度とオッズを利用して、説明変数を選択する基準となる特徴量を各説明変数について算出する。係数推定部１０５から受け取った各説明変数の係数をさらに用いて各説明変数の特徴量を算出してもよい。特徴量算出部１０６は、算出した各説明変数の特徴量、および係数推定部１０５から受け取った各説明変数の係数を、変数選択部１０７に送る。 The feature amount calculation unit 106 uses the frequency and odds calculated by the frequency counting unit 104 to calculate a feature amount serving as a reference for selecting an explanatory variable for each explanatory variable. The feature amount of each explanatory variable may be calculated by further using the coefficient of each explanatory variable received from the coefficient estimation unit 105. The feature amount calculation unit 106 sends the calculated feature amount of each explanatory variable and the coefficient of each explanatory variable received from the coefficient estimation unit 105 to the variable selection unit 107.

特徴量としては、例えば、説明変数が１で目的変数が１のときの頻度の信頼区間に対応する、係数の信頼区間（範囲）を算出し、算出した信頼区間内における係数の絶対値の最小値を採用する。信頼区間の算出方法は十分な標本数があれば正規分布による近似信頼区間を用いてもよいし、より正確な値を利用したければF分布によるフィッシャーの正確信頼区間を用いてもよい。信頼率は一般によく用いられる９５％を用いてもよいし、問題に応じて別の比率を用いてもよい。以下具体例を用いて詳細に説明する。 As the feature amount, for example, the confidence interval (range) of the coefficient corresponding to the confidence interval of the frequency when the explanatory variable is 1 and the target variable is 1, and the absolute value of the coefficient within the calculated confidence interval is the minimum Adopt value. As for the calculation method of the confidence interval, the approximate confidence interval by the normal distribution may be used if there is a sufficient number of samples, or the Fisher exact confidence interval by the F distribution may be used if a more accurate value is used. As the reliability rate, 95%, which is generally used, may be used, or another ratio may be used depending on the problem. A detailed example will be described below.

胃痛項目、胸痛項目、頭痛項目、疲労項目、リスク項目からなる解析対象となるデータ（図示せず）から図６に示すデータが得られているとする。ここで、各説明変数（胃痛、胸痛、頭痛、疲労）の特徴量を算出する場合を考える。解析対象となるデータの全標本数は10,000とし、オッズは１．２（＝目的変数が１の標本数／目的変数が０の標本数）となっているものとする。図６の表において、「標本数」は、全標本数のうち説明変数の値が１のときの標本数であり、「値１の頻度」は、説明変数の値が１かつ目的変数（リスク項目）の値が１のときの標本数（頻度）である。例えば、胸痛に関する説明変数の場合、説明変数が１のときの標本数が7,000であり、説明変数の値が１かつ目的変数の値が１のときの頻度は4,000であることを示している。各説明変数について算出された特徴量と、各説明変数の特徴量を算出する過程で得られたデータを図７に示す。以下図７を参照して各説明変数について特徴量を算出する例を示す。 Assume that the data shown in FIG. 6 is obtained from data (not shown) to be analyzed consisting of stomach pain items, chest pain items, headache items, fatigue items, and risk items. Here, consider the case of calculating the characteristic amount of each explanatory variable (stomach pain, chest pain, headache, fatigue). Assume that the total number of samples of the data to be analyzed is 10,000 and the odds are 1.2 (= number of samples with 1 objective variable / number of samples with 0 objective variable). In the table of FIG. 6, “number of samples” is the number of samples when the value of the explanatory variable is 1 out of the total number of samples, and “frequency of value 1” is 1 for the value of the explanatory variable and the objective variable (risk The number of samples (frequency) when the value of (item) is 1. For example, in the case of explanatory variables related to chest pain, the number of samples when the explanatory variable is 1 is 7,000, and the frequency when the value of the explanatory variable is 1 and the value of the objective variable is 1 indicates 4,000. FIG. 7 shows the feature amount calculated for each explanatory variable and the data obtained in the process of calculating the feature amount for each explanatory variable. Hereinafter, an example in which feature amounts are calculated for each explanatory variable will be described with reference to FIG.

まず、説明変数の値が１であるときの標本数に対する、説明変数の値が１かつ目的変数が１となる頻度の比率に関して、母比率の信頼区間（第１信頼区間）を算出する。信頼率は問題に応じて適切に定めることができる。ここでは信頼率を９５％としたときを例に説明する。母比率の信頼区間を算出する方法は、正規分布による近似を用いた方法やF分布を用いた方法などがあるが、問題に応じて適切な方法を採用することができる。標本数が多ければ正規分布による近似法で十分であるし、より正確な信頼区間を利用したければF分布を用いる方法を採用すればよい。例えば、胸痛に関する説明変数の場合、胸痛に関する説明変数の値が１となるときの標本数7,000に対する、胸痛に関する説明変数の値が１かつ目的変数が１となるときの標本数4,000の比率は4,000/7,000であるから、信頼率95%としてF分布の方法で母比率の信頼区間を算出すると、下限値0.5597365、上限値0.5830611の信頼区間を得る。 First, the confidence interval (first confidence interval) of the population ratio is calculated with respect to the ratio of the frequency at which the value of the explanatory variable is 1 and the objective variable is 1 with respect to the number of samples when the value of the explanatory variable is 1. The reliability rate can be appropriately determined according to the problem. Here, a case where the reliability rate is 95% will be described as an example. Methods for calculating the confidence interval of the population ratio include a method using approximation by normal distribution and a method using F distribution, and an appropriate method can be adopted depending on the problem. If the number of samples is large, the approximation method using the normal distribution is sufficient, and if more accurate confidence intervals are used, the method using the F distribution may be adopted. For example, in the case of an explanatory variable related to chest pain, the ratio of the sample number 4,000 when the value of the explanatory variable related to chest pain is 1 and the objective variable is 1 to the number of samples 7,000 when the value of the explanatory variable related to chest pain is 1 is 4,000. Therefore, when the confidence interval of the population ratio is calculated by the F distribution method with a reliability rate of 95%, a confidence interval having a lower limit value of 0.5597365 and an upper limit value of 0.5830611 is obtained.

次に、母比率の信頼区間を利用して、説明変数の値が１かつ目的変数が１となる母集団の頻度に関する信頼区間（第２信頼区間）を算出する。母比率の信頼区間の上限値と下限値に、説明変数の値が１となるときの標本数を掛け合わせることで、説明変数の値が１かつ目的変数が１となる母集団の頻度の信頼区間が求まる。例えば、胸痛に関する説明変数の場合、標本数が7,000であり、母比率の信頼区間の下限値は0.5597365、上限値は0.5830611であるから、各々掛け合わせて、説明変数の値が１かつ目的変数が１となる母集団頻度の信頼区間の下限値は3918.1555、上限値は4081.4277となる。 Next, using the confidence interval of the population ratio, a confidence interval (second confidence interval) regarding the frequency of the population in which the value of the explanatory variable is 1 and the objective variable is 1 is calculated. By multiplying the upper and lower limits of the confidence interval of the population ratio by the number of samples when the value of the explanatory variable is 1, the confidence in the frequency of the population where the value of the explanatory variable is 1 and the objective variable is 1 A section is obtained. For example, in the case of explanatory variables related to chest pain, the number of samples is 7,000, the lower limit value of the confidence interval of the population ratio is 0.5597365, and the upper limit value is 0.5830611, so when multiplied, the value of the explanatory variable is 1 and the objective variable is The lower limit value of the confidence interval of the population frequency of 1 is 3918.1555, and the upper limit value is 4081.4277.

次に、説明変数の値が１かつ目的変数が１となる母集団の頻度に関する信頼区間の下限値と上限値から、それぞれに対応する、説明変数が１となるときのオッズを算出する。まず説明変数の値が１となる標本数から、説明変数の値が１かつ目的変数が１となる頻度の下限値と上限値をそれぞれ減算することにより、説明変数の値が１かつ目的変数が０となる頻度の信頼区間（第３信頼区間）を算出する。例えば、胸痛に関する説明変数の場合、標本数が7,000であり、説明変数の値が１かつ目的変数が１となる頻度の下限値は上述のように3918.1555、上限値は4081.4277であるから、説明変数の値が１かつ目的変数が０となる頻度の信頼区間は上限値3081.8445、下限値2918.5723になる。次いで、説明変数の値が１かつ目的変数が１となる頻度の下限値3918.1555および上限値4081.4277を、それぞれに対応する説明変数が１かつ目的変数が０となる頻度3081.8445、2918.5723で除算することにより、胸痛に関する説明変数が１となるときのオッズの信頼区間（第５信頼区間）が、 1.27から1.4として求まる。なおこの値は、見やすいように小数点第３位を四捨五入してあるが、以下の計算ではより高精度な値を用いている。 Next, the odds when the explanatory variable is 1 are calculated from the lower limit value and the upper limit value of the confidence interval related to the frequency of the population where the value of the explanatory variable is 1 and the objective variable is 1. First, by subtracting the lower limit value and upper limit value of the frequency at which the value of the explanatory variable is 1 and the objective variable is 1 from the number of samples in which the value of the explanatory variable is 1, the value of the explanatory variable is 1 and the objective variable is A confidence interval (third confidence interval) having a frequency of 0 is calculated. For example, in the case of explanatory variables related to chest pain, since the number of samples is 7,000, the value of the explanatory variable is 1 and the target variable is 1, the lower limit of the frequency is 3918.1555 and the upper limit is 4081.4277 as described above. The confidence interval of the frequency at which the value of 1 is 1 and the objective variable is 0 is the upper limit value 3081.8445 and the lower limit value 2918.5723. Then, by dividing the lower limit value 3918.1555 and the upper limit value 4081.4277 of the frequency at which the explanatory variable value is 1 and the objective variable is 1, by the frequencies 3081.8445 and 2918.5723 at which the corresponding explanatory variable is 1 and the objective variable is 0, respectively. The odds confidence interval (fifth confidence interval) when the explanatory variable for chest pain is 1 is obtained from 1.27 to 1.4. Note that this value is rounded off to the second decimal place for easy viewing, but the following calculation uses a more accurate value.

次に、説明変数が１となるときのオッズの下限値と上限値に対応する係数をもとめ、変数選択の基準となる特徴量を得る。より詳細には、全標本についてのオッズと、全標本数と、説明変数が１となるときのオッズの下限値および上限値と、説明変数が１のときの標本数を、数式２に代入することで、係数の信頼区間（第４信頼区間または第６信頼区間）が算出される。例えば、胸痛に関する説明変数の場合、オッズの下限値は1.27、上限値は1.4であるから、係数の信頼区間は、下限値0.21、上限値0.64となる。 Next, the coefficient corresponding to the lower limit value and the upper limit value of the odds when the explanatory variable is 1 is obtained, and a feature quantity serving as a reference for variable selection is obtained. More specifically, the odds for all samples, the total number of samples, the lower and upper limit values of odds when the explanatory variable is 1, and the number of samples when the explanatory variable is 1 are substituted into Equation 2. Thus, the confidence interval (the fourth confidence interval or the sixth confidence interval) of the coefficient is calculated. For example, in the case of explanatory variables related to chest pain, the lower limit value of the odds is 1.27 and the upper limit value is 1.4, so the confidence interval of the coefficient is the lower limit value 0.21 and the upper limit value 0.64.

変数選択の基準となる特徴量は、係数の信頼区間（第４信頼区間または第６信頼区間）内における、係数の絶対値の最小値とする。例えば、胸痛に関する説明変数の特徴量は、係数の信頼区間0.21から0.64において、係数の絶対値の最小値は0.21であるから、0.21となる。胃痛に関する説明変数の場合、係数の信頼区間は-0.82から-0.55であるから、絶対値の最小値である0.55が特徴量となる。頭痛に関する説明変数の場合、信頼区間上限から係数を算出すると数式２の対数内の分母であるＮＲ−Ｎ_ｋＲ_ｋが負の値を取るため、正しい値が得られない。このような場合、便宜上正の無限大と解釈して扱う。したがって、係数の信頼区間の範囲は2.37から無限大となるため、特徴量は2.37となる。疲労に関する説明変数の場合、係数は-0.15から0.07となっており、この範囲には0が含まれるため、特徴量は0となる。このようにして、信頼区間を考慮することで、低頻度の説明変数の係数が偶然性により極度に大きい、もしくは、小さい値として算出されることを防ぐことができる。 The feature quantity serving as a reference for variable selection is the minimum value of the absolute value of the coefficient within the coefficient confidence interval (the fourth confidence interval or the sixth confidence interval). For example, the characteristic amount of the explanatory variable related to chest pain is 0.21 because the minimum value of the absolute value of the coefficient is 0.21 in the coefficient confidence interval 0.21 to 0.64. In the case of an explanatory variable related to gastric pain, the confidence interval of the coefficient is -0.82 to -0.55, so the minimum value of 0.55 is the feature value. In the case of an explanatory variable related to headache, if a coefficient is calculated from the upper limit of the confidence interval, the correct value cannot be obtained because NR-N _k R _k which is the denominator in the logarithm of Equation 2 takes a negative value. In such a case, it is interpreted as positive infinity for convenience. Therefore, since the range of the confidence interval of the coefficient is infinite from 2.37, the feature amount is 2.37. In the case of explanatory variables related to fatigue, the coefficient is -0.15 to 0.07, and 0 is included in this range, so the feature amount is 0. In this way, by considering the confidence interval, it is possible to prevent the coefficient of the low-frequency explanatory variable from being calculated as an extremely large or small value due to chance.

次に変数選択部１０７および第１のモデル構築部１０８について説明する。変数選択部１０７および第１のモデル構築部１０８による処理の流れを図８のフローチャートに示す。 Next, the variable selection unit 107 and the first model construction unit 108 will be described. The flow of processing by the variable selection unit 107 and the first model construction unit 108 is shown in the flowchart of FIG.

変数選択部１０７は、特徴量算出部１０６において算出された各説明変数の特徴量に基づいて、モデルに採用される説明変数を所定数（本例では１つ）選択する（Ｓ２１）。特徴量として信頼区間内における係数の絶対値の最小値を採用しているときは、特徴量が最大となる説明変数から優先的に所定数を選択する。図７に示した例の場合、頭痛、胃痛、胸痛、疲労の優先順位で説明変数が選択される。ここでは頭痛の説明変数が選択される。特徴量として他の指標を用いた場合は特徴量が最小となる説明変数から優先的に所定数を選択する場合もある。 The variable selection unit 107 selects a predetermined number (one in this example) of explanatory variables to be used in the model based on the characteristic amount of each explanatory variable calculated by the characteristic amount calculation unit 106 (S21). When the minimum value of the absolute value of the coefficient within the confidence interval is adopted as the feature amount, a predetermined number is preferentially selected from the explanatory variables that maximize the feature amount. In the case of the example shown in FIG. 7, explanatory variables are selected in the priority order of headache, stomach pain, chest pain, and fatigue. Here, the explanatory variable for headache is selected. When another index is used as the feature amount, a predetermined number may be preferentially selected from the explanatory variables that minimize the feature amount.

第１のモデル構築部（変数選択収束判定部）１０８は、変数選択部１０７により選択された説明変数と、選択された説明変数の係数を用いてモデルを構築する（Ｓ２２）。頭痛の説明変数と、頭痛の説明変数の係数とから構築されたモデルを図９（モデル構築回数１回目）に示す。すなわち、図６より頭痛に関する係数は３．５６であるから、数式１より、構築回数１回目として、図９に示すモデルが構築される。 The first model construction unit (variable selection convergence determination unit) 108 constructs a model using the explanatory variable selected by the variable selection unit 107 and the coefficient of the selected explanatory variable (S22). A model constructed from the explanatory variables for headaches and the coefficients of the explanatory variables for headaches is shown in FIG. That is, since the coefficient for headache is 3.56 from FIG. 6, the model shown in FIG.

第１のモデル構築部１０８は、構築したモデルの性能を判断するための指標を算出する（Ｓ２３）。モデル性能の指標および基準は問題に応じて異なるため、問題に応じて適宜選択する。例えば、モデルの全般的な性能が重要であれば、ROC（receiver operating characteristic）曲線のAUC（area under curve）のような指標値を利用する。また、例えば、目的変数が１となるとモデルが予想した中で、実際に目的変数が１となる割合が重要であれば、適合率のような指標値を利用する。また、例えば、実際に目的変数が１となっているうち、目的変数が１となるとモデルが予想できた割合が重要であれば、再現率のような指標値を利用する。また、適合率と再現率のバランスが重要であれば、F値のような指標値を利用する。また、高速に算出することが重要であれば、対数尤度のような指標値を利用する。これらのいずれか一つもしくは複数組み合わせた指標値を利用する。本例では、ROC曲線のAUCを指標値として用いることとし、ここではAUCとして０．５５が算出されたとする。なお、ROC曲線のAUCは、値が大きいほど高性能なモデルであることを示す。 The first model construction unit 108 calculates an index for judging the performance of the constructed model (S23). Since the model performance index and standard differ depending on the problem, it is appropriately selected depending on the problem. For example, if the overall performance of the model is important, an index value such as an AUC (area under curve) of a receiver operating characteristic (ROC) curve is used. In addition, for example, if the model predicts that the objective variable will be 1, and if the ratio of the objective variable to 1 is actually important, an index value such as the precision is used. Also, for example, if the ratio of the model that can be predicted when the objective variable is 1 among the objective variables of 1 is important, an index value such as the recall is used. If the balance between precision and recall is important, an index value such as the F value is used. If it is important to calculate at high speed, an index value such as log likelihood is used. Any one or a combination of these index values is used. In this example, it is assumed that AUC of the ROC curve is used as an index value, and here, 0.55 is calculated as AUC. Note that the AUC of the ROC curve indicates that the larger the value, the higher the performance model.

この時点では、比較するための既存モデルが存在しないため（Ｓ２４のＹＥＳ）、ステップＳ２１に戻り、２回目のモデル構築を行う。 At this time, since there is no existing model for comparison (YES in S24), the process returns to step S21 and the second model construction is performed.

変数選択部１０７は、頭痛に続いて優先度の高い胃痛に関する説明変数を選択する（Ｓ２１）。 The variable selection unit 107 selects an explanatory variable related to stomach pain having a high priority following a headache (S21).

図６より胃痛に関する係数は−０．６９であるから、第１のモデル構築部１０８は、構築回数１回目に構築したモデルに、係数として−０．６９をもつ説明変数を追加してモデルを更新する（Ｓ２２）。この結果、図９に示すように、構築回数２回目のモデルが生成される。 Since the coefficient for stomach pain is −0.69 from FIG. 6, the first model construction unit 108 adds an explanatory variable having −0.69 as a coefficient to the model constructed for the first time of construction, and calculates the model. Update (S22). As a result, as shown in FIG. 9, a model with the second construction count is generated.

第１のモデル構築部１０８は、このモデルの性能指標値を算出する（Ｓ２３）。このモデルより算出されるROC曲線AUCは０．６７になったものとする。 The first model construction unit 108 calculates the performance index value of this model (S23). The ROC curve AUC calculated from this model is assumed to be 0.67.

第１のモデル構築部１０８は、最新のモデル（構築回数２回目のモデル）の性能指標値と、前回のモデル（構築回数１回目のモデル）の性能指標値とを比較する。ここでは、指標値を比較する対象は過去１回のモデルまでに定められているとし、したがって過去１回分の性能指標値のみを、最新のモデルの性能指標値と比較する。図９に示すように、構築回数１回目と２回目のROC曲線のAUCを比較すると２回目の方が値が大きい。これは、前回のモデルより最新のモデルの性能の方が高いことを示している。したがって、性能改善の余地があり、まだ十分なモデルが構築されないと判定し（モデルの性能の評価結果が所定の終了条件を満たさないと判定し）（Ｓ２４のＹＥＳ）、３回目のモデル構築を行うためステップＳ２１に戻る。 The first model construction unit 108 compares the performance index value of the latest model (model with the second construction count) with the performance index value of the previous model (model with the first construction count). Here, it is assumed that the target for comparing the index values is determined up to the past one model, and therefore, only the performance index value for the past one time is compared with the performance index value of the latest model. As shown in FIG. 9, when the AUCs of the first and second ROC curves are compared, the second one has a larger value. This indicates that the performance of the latest model is higher than the previous model. Therefore, there is room for performance improvement, and it is determined that a sufficient model has not yet been constructed (determined that the performance evaluation result of the model does not satisfy the predetermined termination condition) (YES in S24). Return to step S21 to perform.

変数選択部１０７は、胃痛に続いて優先度の高い胸痛に関する説明変数を選択する（Ｓ２１）。 The variable selection unit 107 selects an explanatory variable related to chest pain having a high priority following gastric pain (S21).

図６より胸痛に関する係数は０．４１であるから、第１のモデル構築部１０８は、構築回数２回目に構築したモデルに、係数として０．４１をもつ説明変数を追加してモデルを更新する（Ｓ２２）。この結果、図９に示すように、構築回数３回目のモデルが構築される。 Since the coefficient relating to chest pain is 0.41 from FIG. 6, the first model construction unit 108 updates the model by adding an explanatory variable having a coefficient of 0.41 to the model constructed the second time. (S22). As a result, as shown in FIG. 9, a model with the third construction count is constructed.

第１のモデル構築部１０８は、このモデルの性能指標値を算出する（Ｓ２３）。このモデルより算出されるROC曲線AUCは０．６６になったものとする。 The first model construction unit 108 calculates the performance index value of this model (S23). The ROC curve AUC calculated from this model is assumed to be 0.66.

第１のモデル構築部１０８は、構築回数２回目と３回目とにおけるROC曲線のAUCを比較すると、２回目の方が値が大きいので、２回目の時点で十分な性能のモデルが構築されていると判定し（モデルの性能の評価結果が所定の終了条件を満たしたと判定し）（Ｓ２４のＮＯ）、処理を終了する。処理を終了した時点までに構築されたモデルのうち最もAUCが高いのは２回目のモデルのため、２回目のモデルを最終的に構築されたモデルとして、データ記憶部１０２に格納もしくは出力部１０９に送る。 When the first model construction unit 108 compares the AUCs of the ROC curves between the second and third construction times, the second one has a larger value, so that a model with sufficient performance is constructed at the second time. (It is determined that the evaluation result of the model performance satisfies a predetermined termination condition) (NO in S24), and the process is terminated. The model having the highest AUC among the models constructed up to the point of time when the processing is completed is the second model. Therefore, the second model is stored in the data storage unit 102 or output unit 109 as the finally constructed model. Send to.

出力部１０９は、第１のモデル構築部１０８から送られたモデル（選択された説明変数とその説明変数に対応する係数）を出力する。出力部１０９としては用途に応じて最適な選択をすることが可能であり、計算機ディスプレイでもよいし、印刷機でもよいし、携帯端末機器でもよい。また、出力部１０９は、出力方法として、複数のモデルを一括出力してもよいし、逐次出力してもよいし、精度のようなモデルの性質に応じて整列させて出力させてもよい。また、出力部１０９は、記憶部もしくは記憶媒体に出力を行うことも可能である。 The output unit 109 outputs the model (the selected explanatory variable and the coefficient corresponding to the explanatory variable) sent from the first model construction unit 108. The output unit 109 can be optimally selected depending on the application, and may be a computer display, a printing machine, or a portable terminal device. Further, as an output method, the output unit 109 may output a plurality of models at once, may output them sequentially, or may output them by aligning them according to the model properties such as accuracy. The output unit 109 can also output to a storage unit or a storage medium.

図１０は、図１の変数選択装置により行われる全体の処理の流れを示すフローチャートである。このフローチャートに示される各ステップの処理はコンピュータにプログラムを実行させることによって実現してもよい。 FIG. 10 is a flowchart showing the overall processing flow performed by the variable selection device of FIG. The processing of each step shown in this flowchart may be realized by causing a computer to execute a program.

まずモデル構築に用いられるデータを準備する（Ｓ１１）。モデル構築に用いられるデータがデータ記憶部１０２に存在しない場合は、データを入力部１０１から入力する。 First, data used for model construction is prepared (S11). When the data used for model construction does not exist in the data storage unit 102, the data is input from the input unit 101.

次にモデル構築に用いられるデータに、２値変数ではないデータが含まれているかどうかを調べ（Ｓ１２）、含まれている場合は（Ｓ１２のＹＥＳ）、変数変換部１０３によりそれを２値変数に変換する（Ｓ１３）。 Next, it is checked whether or not the data used for model construction includes data that is not a binary variable (S12). If it is included (YES in S12), the variable conversion unit 103 converts it into a binary variable. (S13).

次に、頻度計数部１０４により、頻度とオッズの算出を行う（Ｓ１４）。すなわち目的変数が０となる頻度と、目的変数が１となる頻度とを数え上げ、数え上げられた各頻度から全標本についてのオッズを算出する。また、説明変数が１のときに目的変数が０となる頻度と、説明変数が１のときに目的変数が１となる頻度を数え上げ、数え上げられた各頻度から説明変数が１となるときのオッズを算出する。 Next, the frequency counting unit 104 calculates the frequency and odds (S14). That is, the frequency at which the objective variable is 0 and the frequency at which the objective variable is 1 are counted, and the odds for all samples are calculated from the counted frequencies. Further, the frequency at which the objective variable becomes 0 when the explanatory variable is 1 and the frequency at which the objective variable becomes 1 when the explanatory variable is 1, and the odds when the explanatory variable becomes 1 from each counted frequency are counted. Is calculated.

次に、係数推定部１０５により各説明変数の係数を推定する（Ｓ１５）。 Next, the coefficient of each explanatory variable is estimated by the coefficient estimation unit 105 (S15).

次に、特徴量算出部１０６により、各説明変数の係数の信頼区間から変数選択の基準となる特徴量を各説明変数について算出する（Ｓ１６）。 Next, the feature amount calculation unit 106 calculates a feature amount serving as a reference for variable selection for each explanatory variable from the confidence interval of the coefficient of each explanatory variable (S16).

次に、変数選択部１０７により、各説明変数の特徴量に応じて説明変数を選択する（Ｓ１７）。 Next, the variable selection unit 107 selects an explanatory variable according to the feature amount of each explanatory variable (S17).

次に、第１のモデル構築部１０８により、変数選択部１０８で選択された説明変数と、選択された説明変数の係数とからモデルを構築し、構築されたモデルの性能を評価する（Ｓ１８）。モデルが高い性能を有していない場合（モデルの性能の評価結果が所定の終了条件を満たさない場合）は（Ｓ１８のＮＯ）、ステップＳ１７に戻り変数選択を再度行い、第１のモデル構築部１０８は、再度選択された説明変数と、該説明変数の係数とを用いて、モデルを更新し、モデルの性能を評価する（Ｓ１８）。第１のモデル構築部１０８は高い性能のモデルが得られた場合（モデルの性能の評価結果が所定の終了条件を満たした場合）は（Ｓ１８のＹＥＳ）、処理を終了する。 Next, the first model construction unit 108 constructs a model from the explanatory variable selected by the variable selection unit 108 and the coefficient of the selected explanatory variable, and evaluates the performance of the constructed model (S18). . When the model does not have high performance (when the evaluation result of the model performance does not satisfy the predetermined termination condition) (NO in S18), the process returns to step S17 to perform variable selection again, and the first model construction unit A step 108 updates the model using the explanatory variable selected again and the coefficient of the explanatory variable, and evaluates the performance of the model (S18). The first model construction unit 108 ends the process when a high performance model is obtained (when the evaluation result of the model performance satisfies a predetermined end condition) (YES in S18).

以上のように、本実施形態によれば、各説明変数および目的変数について短時間に得ることが可能な頻度とオッズとを用いて特徴量を算出し、算出した特徴量に基づきモデルに使用する説明変数を選択するため、標本数と説明変数が多い場合であっても高速にモデルを構築することが可能となる。 As described above, according to the present embodiment, the feature amount is calculated using the frequency and odds that can be obtained in a short time for each explanatory variable and the objective variable, and used in the model based on the calculated feature amount. Since an explanatory variable is selected, a model can be constructed at high speed even when the number of samples and explanatory variables are large.

図１１は、図１の頻度計数部の変型例を示す。 FIG. 11 shows a modification of the frequency counting unit in FIG.

頻度計数部３０４は、複数の変数非選択時頻度計数部３０１、３０２と、複数の変数選択時頻度計数部３０３、３０４と、変数非選択時頻度集約部３０５と、変数選択時頻度集約部３０６と、変数非選択時オッズ算出部３０７と、変数選択時オッズ算出部３０８とを備える。 The frequency counting unit 304 includes a plurality of variable non-selection frequency counting units 301 and 302, a plurality of variable selection frequency counting units 303 and 304, a variable non-selection frequency aggregation unit 305, and a variable selection frequency aggregation unit 306. And a variable non-selection odds calculation unit 307 and a variable selection odds calculation unit 308.

変数非選択時頻度計数部３０１は、データ記憶部１０２に記憶されたデータの一部（例えば全標本のうちの一部）を受け取り、変数非選択時頻度計数部３０２はデータ記憶部１０２に記憶されたデータの残りの一部または全部を受け取り、それぞれ、目的変数が１のときの頻度、０のときの頻度を数え上げる。同様に、変数選択時頻度計数部３０３は、データ記憶部１０２に記憶されたデータの一部を受け取り、変数選択時頻度計数部３０４はデータ記憶部１０２に記憶されたデータの残りの一部または全部を受け取り、それぞれ、説明変数が１のときに目的変数が１のときの頻度、説明変数が１のときに目的変数が０のときの頻度を数え上げる。 The variable non-selection frequency counting unit 301 receives a part of the data stored in the data storage unit 102 (for example, a part of all samples), and the variable non-selection frequency counting unit 302 stores the data in the data storage unit 102. The remaining part or all of the received data is received, and the frequency when the objective variable is 1 and the frequency when it is 0 are counted. Similarly, the variable selection frequency counting unit 303 receives part of the data stored in the data storage unit 102, and the variable selection frequency counting unit 304 stores the remaining part of the data stored in the data storage unit 102 or All are received, and the frequency when the objective variable is 1 when the explanatory variable is 1 and the frequency when the objective variable is 0 when the explanatory variable is 1 are counted.

変数非選択時頻度集約部３０５は、変数非選択時頻度計数部３０１、３０２において個別に数え上げられた目的変数が１の頻度をそれぞれ足し合わせ、また目的変数が０の頻度をそれぞれ足し合わせる。例えば、変数非選択時頻度計数部３０１において数え上げられた目的変数が０のときの頻度が３００で、目的変数が１のときの頻度が１００であり、変数非選択時頻度計数部３０２において数え上げられた目的変数が０のときの頻度が２００で、目的変数が１のときの頻度が５０の場合、変数非選択時頻度集約部３０５において算出される目的変数が０のときの頻度が５００、目的変数が１のときの頻度が１５０として算出される。変数選択時頻度集約部３０６においても同様に、変数非選択時頻度計数部３０３、３０４において個別に数え上げられた頻度を、項目ごとに集約して足し合わせ、項目ごとに説明変数が１のときに目的変数が１ときの頻度、説明変数が１のときに目的変数が０のときの頻度を算出する。 The variable non-selection frequency aggregating unit 305 adds the frequencies of the objective variables counted individually by the variable non-selection frequency counting units 301 and 302, respectively, and adds the frequency of the objective variable of 0. For example, the frequency when the objective variable counted in the variable non-selection frequency counting unit 301 is 0 is 300, the frequency when the objective variable is 1 is 100, and is counted by the variable non-selection frequency counting unit 302. When the target variable is 0, the frequency is 200, and when the target variable is 1, the frequency is 50. When the target variable calculated by the variable non-selection frequency aggregation unit 305 is 0, the frequency is 500. The frequency when the variable is 1 is calculated as 150. Similarly, in the variable selection frequency aggregating unit 306, the frequencies individually counted in the variable non-selection frequency counting units 303 and 304 are aggregated and added for each item, and when the explanatory variable is 1 for each item. The frequency when the objective variable is 1 and the frequency when the objective variable is 0 when the explanatory variable is 1 are calculated.

変数非選択時頻度集約部３０５において算出された各頻度は、変数非選択時オッズ算出部３０７に送られる。また変数非選択時頻度集約部３０６において説明変数ごとに算出された各頻度は、変数非選択時オッズ算出部３０８に送られる。変数非選択時オッズ算出部３０７および変数選択時オッズ算出部３０８による処理は、図４の変数非選択時オッズ算出部２０２および変数選択時オッズ算出部２０４と同様である。 Each frequency calculated in the variable non-selection frequency aggregation unit 305 is sent to the variable non-selection odds calculation unit 307. Each frequency calculated for each explanatory variable in the variable non-selection frequency aggregation unit 306 is sent to the variable non-selection odds calculation unit 308. The processing by the variable non-selection odds calculation unit 307 and the variable selection odds calculation unit 308 is the same as the variable non-selection odds calculation unit 202 and variable selection odds calculation unit 204 of FIG.

このような構成を用いることで、最も処理時間を要する頻度の数え上げ処理を並列に行うことが可能になるため、さらに高速なモデル構築が可能となる。図１１では並列化の数は２であるが、用途や目的に応じて並列化の数を増加させることが可能である。 By using such a configuration, it is possible to perform the counting process of the frequency that requires the most processing time in parallel, and thus it is possible to construct a model at a higher speed. In FIG. 11, the number of parallelization is 2, but the number of parallelization can be increased according to the use and purpose.

ここで、データ記憶部が、頻度に関する統計情報を保持可能なデータベースマネジメントシステムであり、さらに、目的変数が１のときのデータを格納したデータベースと、目的変数が０のときのデータを格納したデータベースとが分離されている場合は、頻度計数部において変数非選択時頻度計数部と、変数選択時頻度計数部とを省略することが可能である。すなわち変数非選択時頻度計数部と、変数選択時頻度計数部との機能をデータベースマネジメントシステムに組み込んでもよく本発明はこれを含む。以下より詳細に説明する。 Here, the data storage unit is a database management system capable of holding statistical information regarding frequency, and further stores a database storing data when the objective variable is 1, and a database storing data when the objective variable is 0 Are separated, the frequency non-selection frequency counting unit and the variable selection frequency counting unit can be omitted in the frequency counting unit. That is, the functions of the variable non-selection frequency counting unit and the variable selection frequency counting unit may be incorporated in the database management system, and the present invention includes this. This will be described in more detail below.

図１２は、頻度計数部から変数非選択時頻度計数部と、変数選択時頻度計数部とを省略し、また、目的変数が０のときのデータを格納したデータベースをデータ記憶部４０１に記憶させ、目的変数が１のときのデータを格納したデータベースをデータ記憶部４０２に記憶させた例を示す図である。クエリを扱うことが可能な高性能なデータベースマネジメントシステムでは、複雑なクエリを効率的に処理するために、各項目の頻度に関する統計情報を自動的に保持していることが多い。この場合、変数非選択時頻度計数部と、変数選択時頻度計数部とが存在しなくとも、変数非選択時オッズ算出部４０４および変数選択時オッズ算出部４０５は、データ記憶部４０１、４０２からオッズを算出するための頻度情報を即座に得ることができる。このような構成にすることで、もっとも処理時間を要する頻度計数の処理を省略することができるため、さらに著しく高速なモデル構築が可能となる。 FIG. 12 omits the variable non-selection frequency counting unit and the variable selection frequency counting unit from the frequency counting unit, and causes the data storage unit 401 to store a database storing data when the target variable is 0. FIG. 4 is a diagram illustrating an example in which a data storage unit 402 stores a database storing data when an objective variable is 1. A high-performance database management system capable of handling queries often automatically stores statistical information on the frequency of each item in order to efficiently process a complex query. In this case, even if the variable non-selection frequency counting unit and the variable selection frequency counting unit do not exist, the variable non-selection odds calculation unit 404 and the variable selection odds calculation unit 405 are stored in the data storage units 401 and 402. Frequency information for calculating odds can be obtained immediately. By adopting such a configuration, it is possible to omit the frequency counting process that requires the most processing time, so that it is possible to construct a model at a significantly higher speed.

図１３は、図１の構成において、データ記憶部と、変数変換部と、頻度計数部と、係数推定部と、特徴量算出部とを並列化したときの構成の例を示す。すなわち変数変換部１０３（１）、頻度計数部１０４（１）、係数推定部１０５（１）、特徴量算出部１０６（１）からなる処理系統と、変数変換部１０３（２）、頻度計数部１０４（２）、係数推定部１０５（２）、特徴量算出部１０６（２）からなる処理系統との２つの処理系統を用いて特徴量の算出までの処理を並列して行う。データ記憶部１０２（１）、１０２（２）には同一の解析対象となるデータが記憶されており、各処理系統はそれぞれ異なる項目をデータ記憶部１０２（１）、１０２（２）から取得して処理する（ただし目的変数に対応する項目は共通に取得される）。図１３では並列化の数は２であるが、用途や目的に応じて並列化の数を増加させることが可能である。このような構成を用いることで、並列的に係数と特徴量を算出することが可能になるため、さらに高速なモデル構築が可能となる。 FIG. 13 shows an example of the configuration when the data storage unit, the variable conversion unit, the frequency counting unit, the coefficient estimation unit, and the feature amount calculation unit are parallelized in the configuration of FIG. That is, a processing system including a variable conversion unit 103 (1), a frequency counting unit 104 (1), a coefficient estimation unit 105 (1), and a feature amount calculation unit 106 (1), a variable conversion unit 103 (2), and a frequency counting unit. The processing up to the calculation of the feature value is performed in parallel using two processing systems including the processing system consisting of 104 (2), the coefficient estimation unit 105 (2), and the feature value calculation unit 106 (2). The data storage units 102 (1) and 102 (2) store the same data to be analyzed, and each processing system acquires different items from the data storage units 102 (1) and 102 (2). (However, the item corresponding to the objective variable is acquired in common). In FIG. 13, the number of parallelization is 2, but the number of parallelization can be increased according to the application and purpose. By using such a configuration, it is possible to calculate the coefficient and the feature amount in parallel, so that it is possible to construct a model at a higher speed.

図１４は、頻度計数部による処理までをサーバー装置で行い、係数推定部による処理以降をクライアント装置で行うようにしたときの構成の例を示す。すなわち図１４は、変数選択装置をサーバー装置とクライアント装置とで構成した場合の例を示しており、本発明の変数選択装置は、図１のような単一の装置として構成される場合のみならず、図１４のように複数の装置から構成される場合も含む。 FIG. 14 shows an example of a configuration when processing up to the processing by the frequency counting unit is performed by the server device and processing after the processing by the coefficient estimating unit is performed by the client device. That is, FIG. 14 shows an example in which the variable selection device is configured by a server device and a client device, and the variable selection device of the present invention can be used only when configured as a single device as shown in FIG. In addition, the case where the apparatus is configured by a plurality of apparatuses as shown in FIG.

サーバー装置５００は入力部５０１と、データ記憶部５０２と、変数変換部５０３と、頻度計数部５０４とを備える。クライアント装置は２台設けられ、クライアント装置５１０は、係数推定部５１５、特徴量算出部５１６、変数選択部５１７、第１のモデル構築部５１８、出力部５１９を備える。クライアント装置５２０も同様に、係数推定部５２５、特徴量算出部５２６、変数選択部５２７、第１のモデル構築部５２８、出力部５２９を備える。サーバー装置５００とクライアント装置５１０、５２０とはローカルネットワークやインターネット等のネットワークを介して接続されている。 The server device 500 includes an input unit 501, a data storage unit 502, a variable conversion unit 503, and a frequency counting unit 504. Two client devices are provided, and the client device 510 includes a coefficient estimation unit 515, a feature amount calculation unit 516, a variable selection unit 517, a first model construction unit 518, and an output unit 519. Similarly, the client device 520 includes a coefficient estimation unit 525, a feature amount calculation unit 526, a variable selection unit 527, a first model construction unit 528, and an output unit 529. Server device 500 and client devices 510 and 520 are connected via a network such as a local network or the Internet.

データ記憶部５０２に記憶されているデータ全体のデータ量と比較して、頻度の情報やオッズの情報はデータ量が少ない。よって、頻度の算出およびオッズの算出のうち少なくとも前者までの処理をサーバー装置で行い（前者までの処理をサーバー装置で行う場合、後者の処理はクライアント装置で行う想定）、それ以降の処理をクライアント装置で行う構成にすることで、ネットワークにさほど負荷をかけずに、サーバー装置の処理負荷を低減させることができる。さらに、１台のクライアント装置の要請によって、説明変数の頻度等がいったんサーバー装置５００のデータ記憶部５０２に記憶保持されれば、他のクライアント装置はサーバー装置５００のデータ記憶部５０２に記憶されている頻度等の情報を利用することができるため、効率的にモデルを構築することができる。なお図１４では、クライアント装置の数を２としているが、用途や目的に応じてその数を増減させることが可能である。以下、サーバ装置５００のデータ記憶部５０２に記憶された頻度の情報を複数のクライアント装置で共通して利用する例を示す。ここでは各クライアント装置が、それぞれ目的変数の異なるモデルを作成する例を説明する。 Compared with the total data amount stored in the data storage unit 502, the frequency information and the odds information have a small data amount. Therefore, at least the former processing of the frequency calculation and odds calculation is performed by the server device (when the processing up to the former is performed by the server device, the latter processing is assumed to be performed by the client device), and the subsequent processing is performed by the client. By adopting the configuration performed by the apparatus, it is possible to reduce the processing load of the server apparatus without imposing much load on the network. Furthermore, once the frequency of explanatory variables is once stored and held in the data storage unit 502 of the server device 500 at the request of one client device, the other client devices are stored in the data storage unit 502 of the server device 500. Since it is possible to use information such as frequency, the model can be efficiently constructed. In FIG. 14, the number of client devices is two, but the number can be increased or decreased depending on the application or purpose. Hereinafter, an example in which the frequency information stored in the data storage unit 502 of the server device 500 is used in common by a plurality of client devices will be described. Here, an example will be described in which each client device creates models with different objective variables.

例えば、図３において、クライアント装置５１０は「リスク値」を目的変数とし、クライアント装置５２０は「睡眠時間」を目的変数にした場合を考える。以下ではサーバ装置５００の頻度計数部５０４は頻度の計数を行うがオッズを算出する機能は備えておらず、オッズを算出する機能はクライアント装置５１０、５２０における係数推定部が備えているものとする。ただし、当然ながら、サーバ装置５００の頻度計数部５０４が、クライアント装置からの要求に応じてオッズを算出するようにしてもよい。 For example, in FIG. 3, consider a case where the client device 510 uses “risk value” as an objective variable, and the client device 520 uses “sleep time” as an objective variable. In the following, the frequency counting unit 504 of the server device 500 counts the frequency but does not have a function of calculating odds, and the function of calculating the odds is provided by the coefficient estimating unit in the client devices 510 and 520. . However, as a matter of course, the frequency counting unit 504 of the server device 500 may calculate odds in response to a request from the client device.

（１）クライアント装置５１０が説明変数「睡眠状態１」のオッズを算出するため、サーバー装置５００はクライアント装置５１０の要求に応じて「睡眠状態１」が１となる識別子を探索して個数を数え上げ、その中のうち「リスク値」が１となるときと０となるときの個数を数え上げてクライアント装置５１０に送る。数え上げた識別子の個数を表す情報をさらにクライアント装置５１０に送信してもよい。 (1) Since the client device 510 calculates the odds of the explanatory variable “sleep state 1”, the server device 500 searches for an identifier in which “sleep state 1” is 1 in response to a request from the client device 510 and counts the number. Among them, the number when the “risk value” becomes 1 and when it becomes 0 is counted and sent to the client apparatus 510. Information indicating the number of identifiers counted may be further transmitted to the client device 510.

（２）クライアント装置５２０が「睡眠状態１」のオッズを算出するため、サーバー装置５００は、クライアント装置５２０の要求に応じて、（１）で探索済みの「睡眠状態１」が１となる識別子を用い、この識別子の個数のうち、「睡眠時間」が１となるときと０となるときの個数を数え上げて、クライアント装置５２０に送る。（１）で探索済みの識別子の個数を表す情報をさらにクライアント装置５２０に送信してもよい。 (2) Since the client device 520 calculates the odds of “sleep state 1”, the server device 500 determines that the “sleep state 1” searched for in (1) is 1 in response to a request from the client device 520. The number of identifiers when the “sleep time” is 1 and 0 is counted and sent to the client device 520. Information indicating the number of identifiers searched in (1) may be further transmitted to the client device 520.

（２）では（１）の処理結果の一部（探索された「睡眠状態１」が１となる識別子およびその個数）をそのまま使用できるためサーバー装置５００の処理負荷は低減される。 In (2), a part of the processing result of (1) (identifier and number of searched “sleep state 1” being 1) can be used as it is, so that the processing load of the server device 500 is reduced.

また、クライアント装置５１０が説明変数「睡眠時間」のオッズを算出した後は、「睡眠時間」が１となる個数が既に得られているため、全標本数からこれを減算することで、「睡眠時間」が０となるときの個数が得られる。そのため、クライアント装置５２０が目的変数「睡眠時間」のオッズを算出する際には、サーバー装置５００は、「睡眠時間」が１となる個数については新たに数え上げる必要はなく、「睡眠時間」が０となるときの個数のみを全標本数から「睡眠時間」が１となる個数を減算することにより算出すればよい。 In addition, after the client device 510 calculates the odds of the explanatory variable “sleeping time”, the number of “sleeping time” of 1 has already been obtained. The number when “time” becomes zero is obtained. Therefore, when the client device 520 calculates the odds of the objective variable “sleep time”, the server device 500 does not need to newly count the number of “sleep time” being 1, and the “sleep time” is 0. Only the number when “sleep time” is 1 is subtracted from the total number of samples.

図１５は、本発明の他の実施の形態に係わる変数選択装置の構成を示す図である。この変数選択装置は、入力部６０１、データ記憶部６０２、変数変換部６０３、頻度計数部６０４、係数推定部６０５、特徴量算出部６０６、変数選択部６０７、第２のモデル構築部６０８、出力部６０９を備える。要素６０１〜６０６、６０９は図１の同一名称の要素と同一の機能を有する。 FIG. 15 is a diagram showing a configuration of a variable selection device according to another embodiment of the present invention. The variable selection device includes an input unit 601, a data storage unit 602, a variable conversion unit 603, a frequency counting unit 604, a coefficient estimation unit 605, a feature amount calculation unit 606, a variable selection unit 607, a second model construction unit 608, an output. Part 609. Elements 601 to 606 and 609 have the same functions as the elements having the same names in FIG.

変数選択部６０７は、特徴量が最大となる説明変数から順に、予め定めた数の説明変数を選択し、第２のモデル構築部６０８に送る。ただし、選択された説明変数の係数は送らない。図７に示した例の場合、説明変数は３つのみを選択すると定めてあれば、頭痛、胃痛、胸痛に関する説明変数を第２のモデル構築部６０８に送る。 The variable selection unit 607 selects a predetermined number of explanatory variables in order from the explanatory variable having the maximum feature amount, and sends the selected explanatory variable to the second model construction unit 608. However, the coefficient of the selected explanatory variable is not sent. In the case of the example shown in FIG. 7, if it is determined that only three explanatory variables are selected, explanatory variables related to headache, stomach pain, and chest pain are sent to the second model construction unit 608.

第２のモデル構築部６０８は、データ記憶部６０２に記憶されたデータのうち、変数選択部６０７により選択された説明変数と、目的変数とを用いて、任意のモデル構築手法によりモデルを構築する。構築するモデルは、ロジスティック回帰モデルのような統計モデルであってもよいし、決定木のようなルール型モデルであってもよいし、ベイジアンネットのような確率ネットワークモデルでもよい。構築されたモデルは出力部６０９に送られる。このような構成を用いることにより、モデル構築に利用する説明変数の数が削減されるため、高速にモデルを構築することが可能となる。 The second model construction unit 608 constructs a model by an arbitrary model construction method using the explanatory variable selected by the variable selection unit 607 and the objective variable among the data stored in the data storage unit 602. . The model to be constructed may be a statistical model such as a logistic regression model, a rule type model such as a decision tree, or a stochastic network model such as a Bayesian network. The constructed model is sent to the output unit 609. By using such a configuration, the number of explanatory variables used for model construction is reduced, so that a model can be constructed at high speed.

本発明による変数選択装置は、弱仮説モデル生成装置として利用することにより、ブースティングやバギングのような集団学習アルゴリズムと併用することも可能である。本発明によれば、高速に弱仮説モデルを生成することが可能になるため、高速な集団学習が可能となる。 The variable selection device according to the present invention can be used in combination with a collective learning algorithm such as boosting or bagging by being used as a weak hypothesis model generation device. According to the present invention, a weak hypothesis model can be generated at high speed, so that high-speed group learning is possible.

図１、図４、図１１〜図１５に示される各ブロックの機能は、ソフトウェアとして記述し適当な機構をもったコンピュータに処理させても実現可能である。 The functions of the blocks shown in FIG. 1, FIG. 4, and FIG. 11 to FIG. 15 can be realized even if they are described as software and processed by a computer having an appropriate mechanism.

また、本実施形態は、コンピュータに所定の手順を実行させるための、あるいはコンピュータを所定の手段として機能させるための、あるいはコンピュータに所定の機能を実現させるためのプログラムとして実施することもできる。加えて該プログラムを記録したコンピュータ読取り可能な記録媒体として実施することもできる。 The present embodiment can also be implemented as a program for causing a computer to execute a predetermined procedure, causing a computer to function as a predetermined means, or causing a computer to realize a predetermined function. In addition, the present invention can be implemented as a computer-readable recording medium on which the program is recorded.

なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

本発明の実施の形態としての変数選択装置の基本構成図。The basic block diagram of the variable selection apparatus as embodiment of this invention. 初期サンプルの集合を表形式にて示した図。The figure which showed the collection of the initial samples in tabular form. ２値変数ではないデータを２値変数に変換する例を示す図。The figure which shows the example which converts the data which is not a binary variable into a binary variable. 頻度計数部の構成を示す図。The figure which shows the structure of a frequency counting part. 各説明変数（睡眠時間、睡眠状態１、睡眠状態２）に関するデータを示す図。The figure which shows the data regarding each explanatory variable (sleep time, sleep state 1, sleep state 2). 各説明変数（胃痛、胸痛、頭痛、疲労）に関するデータを示す図。The figure which shows the data regarding each explanatory variable (stomach pain, chest pain, headache, fatigue). 各説明変数について算出された特徴量等を示す図。The figure which shows the feature-value etc. which were calculated about each explanatory variable. 変数選択部および第１のモデル構築部による処理の流れを示すフローチャート。The flowchart which shows the flow of the process by a variable selection part and a 1st model construction part. モデルを構築する様子を示す図。The figure which shows a mode that a model is built. 図１の変数選択装置により行われる全体の処理の流れを示すフローチャート。The flowchart which shows the flow of the whole process performed by the variable selection apparatus of FIG. 図１の頻度計数部の変型例を示す図。The figure which shows the modification of the frequency counting part of FIG. 頻度計数部から変数非選択時頻度計数部と、変数選択時頻度計数部とを省略した構成を示す図。The figure which shows the structure which abbreviate | omitted the frequency count part at the time of variable non-selection and the frequency count part at the time of variable selection from the frequency count part. 図１の一部のブロックを並列化した構成を示す図。The figure which shows the structure which parallelized the one part block of FIG. 変数選択装置をサーバー装置とクライアント装置とで構成した場合の例を示す図。The figure which shows the example at the time of comprising a variable selection apparatus with the server apparatus and the client apparatus. 本発明の他の実施の形態に係わる変数選択装置の構成を示す図。The figure which shows the structure of the variable selection apparatus concerning other embodiment of this invention.

符号の説明Explanation of symbols

１０１：入力部
１０２：データ記憶部
１０３：変数変換部
１０４：頻度計数部
１０５：係数推定部
１０６：特徴量算出部
１０７：変数選択部
１０８：第１のモデル構築部
１０９：出力部 101: input unit 102: data storage unit 103: variable conversion unit 104: frequency counting unit 105: coefficient estimation unit 106: feature amount calculation unit 107: variable selection unit 108: first model construction unit 109: output unit

Claims

第１値または第２値を有する複数の説明変数と、所定事象の発生の有無を前記第１値および第２値によって表す目的変数とを含むサンプルの集合を用いて、前記所定事象が発生するまたはしない確率を計算するためのモデルを生成するために用いる説明変数を選択する変数選択装置であって、
前記目的変数が第１値をもつサンプルの頻度を第１頻度として、前記目的変数が第２値をもつサンプルの頻度を第２頻度として計数し、
前記説明変数ごとに、前記説明変数が第１値であり前記目的変数が第１値であるサンプルの頻度を第３頻度、前記説明変数が第１値であり前記目的変数が第２値であるサンプルの頻度を第４頻度として計数する頻度計数部と、
前記第１頻度と、第２頻度と、説明変数ごとに得られた前記第３頻度と、説明変数ごとに得られた前記第４頻度とを用いて、各前記説明変数の特徴量をそれぞれ算出する特徴量算出部と、
算出された各前記特徴量に基づき１つ以上の説明変数を選択する変数選択部と、
を備えた変数選択装置。 The predetermined event occurs by using a set of samples including a plurality of explanatory variables having a first value or a second value and an objective variable representing whether or not a predetermined event has occurred by the first value and the second value. A variable selection device that selects explanatory variables used to generate a model for calculating a probability of not or not,
Counting the frequency of samples having the first value of the objective variable as the first frequency, and counting the frequency of samples having the second value of the objective variable as the second frequency,
For each explanatory variable, the frequency of the sample where the explanatory variable is the first value and the objective variable is the first value is the third frequency, the explanatory variable is the first value, and the objective variable is the second value A frequency counting unit for counting the frequency of the sample as the fourth frequency;
The feature amount of each explanatory variable is calculated using the first frequency, the second frequency, the third frequency obtained for each explanatory variable, and the fourth frequency obtained for each explanatory variable. A feature amount calculation unit to
A variable selection unit that selects one or more explanatory variables based on the calculated feature quantities;
Variable selection device with

前記第１頻度と、第２頻度と、説明変数ごとに得られた前記第３頻度と、説明変数ごとに得られた前記第４の頻度とを用いて、各前記説明変数の係数を推定する係数推定部、
をさらに備えたことを特徴とする請求項１に記載の変数選択装置。 A coefficient of each explanatory variable is estimated using the first frequency, the second frequency, the third frequency obtained for each explanatory variable, and the fourth frequency obtained for each explanatory variable. Coefficient estimator,
The variable selection device according to claim 1, further comprising:

前記特徴量算出部は、
前記説明変数ごとに、前記説明変数が第１値であるサンプルの頻度に対する前記第３頻度または前記第４頻度の比率の母集団に対する第１信頼区間を算出し、
算出された前記第１信頼区間と、前記説明変数が第１値であるサンプルの頻度とから、前記第３頻度または第４頻度の第２信頼区間を算出し、
算出された前記第２信頼区間と、前記説明変数が第１値であるサンプルの頻度とから、前記第４頻度または前記第３頻度の第３信頼区間を算出し、
前記第２信頼区間と、前記第３信頼区間と、前記第１頻度と、前記第２頻度とから、前記説明変数の係数の第４信頼区間を算出し、
算出された前記第４信頼区間に基づいて前記説明変数の特徴量を算出する、
ことを特徴とする請求項２に記載の変数選択装置。 The feature amount calculation unit includes:
For each explanatory variable, calculate a first confidence interval for a population of the ratio of the third frequency or the fourth frequency to the frequency of the sample for which the explanatory variable is a first value;
Calculating the second confidence interval of the third frequency or the fourth frequency from the calculated first confidence interval and the frequency of the sample whose explanatory variable is the first value;
Calculating the third confidence interval of the fourth frequency or the third frequency from the calculated second confidence interval and the frequency of the sample whose explanatory variable is the first value;
Calculating a fourth confidence interval of the coefficient of the explanatory variable from the second confidence interval, the third confidence interval, the first frequency, and the second frequency;
Calculating a feature quantity of the explanatory variable based on the calculated fourth confidence interval;
The variable selection device according to claim 2, wherein:

前記特徴量算出部は、前記第４信頼区間において絶対値が最小である値の絶対値を前記特徴量として決定することを特徴とする請求項３に記載の変数選択装置。 The variable selection device according to claim 3, wherein the feature amount calculation unit determines an absolute value of a value having a minimum absolute value in the fourth confidence interval as the feature amount.

前記変数選択部により選択された説明変数と、前記選択された説明変数に対して推定された係数とを用いてモデルを生成する第１のモデル生成部をさらに備えたことを特徴とする請求項１〜４のいずれか一項に記載の変数選択装置。 The apparatus further comprises a first model generation unit that generates a model using an explanatory variable selected by the variable selection unit and a coefficient estimated for the selected explanatory variable. The variable selection apparatus as described in any one of 1-4.

前記第１のモデル生成部は、
生成された前記モデルの性能を評価し、
評価の結果が所定の終了条件を満たさない場合は前記変数選択部に対してさらに説明変数を選択することを要求する、
ことを特徴とする請求項５に記載の変数選択装置。 The first model generation unit includes:
Evaluate the performance of the generated model,
If the result of the evaluation does not satisfy a predetermined termination condition, the variable selection unit is further requested to select an explanatory variable.
The variable selection device according to claim 5, wherein:

前記第１のモデル生成部は、前記モデルとしてロジスティック回帰モデルを生成することを特徴とする請求項５または６に記載の変数選択装置。 The variable selection device according to claim 5, wherein the first model generation unit generates a logistic regression model as the model.

前記サンプルの集合に含まれる、前記選択された説明変数と前記目的変数とを用いてモデルを生成する第２のモデル生成部をさらに備えたことを特徴とする請求項１〜４のいずれか一項に記載の変数選択装置。 5. The apparatus according to claim 1, further comprising a second model generation unit configured to generate a model using the selected explanatory variable and the objective variable included in the set of samples. The variable selection device according to item.

前記第２のモデル生成部は、前記モデルとしてロジスティック回帰モデルを生成することを特徴とする請求項８に記載の変数選択装置。 The variable selection apparatus according to claim 8, wherein the second model generation unit generates a logistic regression model as the model.

複数の説明変数と、所定事象の発生の有無を前記第１値および第２値によって表す目的変数とからなる初期サンプルの集合を入力する入力部と、
各前記初期サンプルに含まれる説明変数の値をあらかじめ与えられた変換規則にしたがって前記第１値または第２値に変換する変数変換部と、
をさらに備えたことを特徴とする請求項１〜９のいずれか一項に記載の変数選択装置。 An input unit for inputting a set of initial samples including a plurality of explanatory variables and an objective variable representing whether or not a predetermined event has occurred by the first value and the second value;
A variable conversion unit that converts the value of the explanatory variable included in each of the initial samples into the first value or the second value according to a conversion rule given in advance;
The variable selection device according to claim 1, further comprising:

第１値または第２値を有する複数の説明変数と、所定事象の発生の有無を前記第１値および第２値によって表す目的変数とを含むサンプルの集合を用いて、前記所定事象が発生するまたはしない確率を計算するためのモデルを生成するために用いる説明変数を選択する変数選択方法であって、
前記目的変数が第１値をもつサンプルの頻度を第１頻度として、前記目的変数が第２値をもつサンプルの頻度を第２頻度として計数し、
前記説明変数ごとに、前記説明変数が第１値であり前記目的変数が第１値であるサンプルの頻度を第３頻度、前記説明変数が第１値であり前記目的変数が第２値であるサンプルの頻度を第４頻度として計数し、
前記第１頻度と、第２頻度と、説明変数ごとに得られた前記第３頻度と、説明変数ごとに得られた前記第４頻度とを用いて、各前記説明変数の特徴量をそれぞれ算出し、
算出された各前記特徴量に基づき１つ以上の説明変数を選択する、
ことを特徴とする変数選択方法。 The predetermined event occurs by using a set of samples including a plurality of explanatory variables having a first value or a second value and an objective variable representing whether or not a predetermined event has occurred by the first value and the second value. A variable selection method for selecting an explanatory variable used to generate a model for calculating a probability of not or not,
Counting the frequency of samples having the first value of the objective variable as the first frequency, and counting the frequency of samples having the second value of the objective variable as the second frequency,
For each explanatory variable, the frequency of the sample where the explanatory variable is the first value and the objective variable is the first value is the third frequency, the explanatory variable is the first value, and the objective variable is the second value Count the sample frequency as the fourth frequency,
The feature amount of each explanatory variable is calculated using the first frequency, the second frequency, the third frequency obtained for each explanatory variable, and the fourth frequency obtained for each explanatory variable. And
Selecting one or more explanatory variables based on the calculated feature quantities;
A variable selection method characterized by that.

第１値または第２値を有する複数の説明変数と、所定事象の発生の有無を前記第１値および第２値によって表す目的変数とを含むサンプルの集合を用いて、前記所定事象が発生するまたはしない確率を計算するためのモデルを生成するために用いる説明変数を選択するためのプログラムであって、
前記目的変数が第１値をもつサンプルの頻度を第１頻度として、前記目的変数が第２値をもつサンプルの頻度を第２頻度として計数するステップと、
前記説明変数ごとに、前記説明変数が第１値であり前記目的変数が第１値であるサンプルの頻度を第３頻度、前記説明変数が第１値であり前記目的変数が第２値であるサンプルの頻度を第４頻度として計数するステップと、
前記第１頻度と、第２頻度と、説明変数ごとに得られた前記第３頻度と、説明変数ごとに得られた前記第４頻度とを用いて、各前記説明変数の特徴量をそれぞれ算出するステップと、
算出された各前記特徴量に基づき１つ以上の説明変数を選択するステップと、
をコンピュータに実行させるためのプログラム。 The predetermined event occurs by using a set of samples including a plurality of explanatory variables having a first value or a second value and an objective variable representing whether or not a predetermined event has occurred by the first value and the second value. A program for selecting explanatory variables used to generate a model for calculating a probability of not or not,
Counting the frequency of samples having the first value of the objective variable as a first frequency and counting the frequency of samples having the second value of the objective variable as a second frequency;
For each explanatory variable, the frequency of the sample where the explanatory variable is the first value and the objective variable is the first value is the third frequency, the explanatory variable is the first value, and the objective variable is the second value Counting the frequency of the samples as a fourth frequency;
The feature amount of each explanatory variable is calculated using the first frequency, the second frequency, the third frequency obtained for each explanatory variable, and the fourth frequency obtained for each explanatory variable. And steps to
Selecting one or more explanatory variables based on the calculated feature quantities;
A program that causes a computer to execute.