JP7499597B2

JP7499597B2 - Learning model construction system and method

Info

Publication number: JP7499597B2
Application number: JP2020073327A
Authority: JP
Inventors: 敬大濱本; 剛田中; 大輔田代
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2024-06-14
Anticipated expiration: 2040-04-16
Also published as: JP2021170244A

Description

本発明は、複数機関から収集したデータなど、その特性が異なる複数のデータを用いた機械学習モデル（含む予測モデル）の構築技術に関する。その中でも特に、機械学習モデルを構築する際の、各データのデータ特性の差異を補正する技術に関するものである。 The present invention relates to a technology for constructing a machine learning model (including a predictive model) using multiple pieces of data with different characteristics, such as data collected from multiple institutions. In particular, the present invention relates to a technology for correcting differences in the data characteristics of each piece of data when constructing a machine learning model.

近年、過去の実績データを用いた機械学習の利用によって公共、金融、医療、マーケティングなど多様な分野において人間の意思決定を支援するシステムの構築が進んでいる。データ分析者は実績データを丹念に調べ、各説明変数に対し適切な処置や変換を施すことで目的変数の予測に対してより有用な変数を作成し、高精度な予測モデルの作成を可能にしている。 In recent years, progress has been made in building systems that support human decision-making in a variety of fields, including the public sector, finance, medicine, and marketing, by using machine learning with past performance data. Data analysts carefully examine performance data and apply appropriate measures and transformations to each explanatory variable to create variables that are more useful for predicting the objective variable, making it possible to create highly accurate predictive models.

一方で各事業者におけるデータ分析技能を有する人材や分析対象データの不足から、複数機関のデータを集約して単一の統合機械学習モデルを作成するコンソーシアム型のシステムが台頭してきている。これにより各事業者は自身にはないデータを含む多様なデータセットを用いて構築された予測モデルを利用できることで、低コストで機械学習を利用した事業を展開できる。 On the other hand, due to a shortage of personnel with data analysis skills and data to analyze at each business, consortium-type systems that aggregate data from multiple institutions to create a single integrated machine learning model are becoming more common. This allows each business to use predictive models built using diverse data sets, including data that the business does not have, enabling them to develop businesses that use machine learning at low cost.

こうした複数の取得元からデータを収集して分析を行う手法に関して、特許文献1には「異なるフォーマット及び記録周波数を有する様々なソースから受信した産業データの事前処理に加えて、製造業及びプロセスプラントの業績を監視するための業績評価指標の最適化」を行う手法が記載されている。ここで、事前処理とは外れ値の除去や欠測値の補完などによりフォーマットを統一することを表す。 Regarding a method for collecting and analyzing data from multiple sources, Patent Document 1 describes a method for "pre-processing industrial data received from various sources with different formats and recording frequencies, as well as optimizing performance indicators for monitoring the performance of manufacturing and process plants." Here, pre-processing refers to standardizing the format by removing outliers and filling in missing values.

特開2018-195308号公報JP 2018-195308 A

単一の統合機械学習モデルを構築するために複数機関から収集したデータには、一般に各機関の顧客層、地域性、業務方針などに応じてデータ分布の差が存在する。つまり、収集したデータ間で、データ特性が異なることがある。あるいはデータ収集、記録方法などの違いから人為的にもデータ分布の差が存在する。また、複数機関に限らず、複数の収集したデータにおいて、データ分布の差が存在することがある。つまり、多様性のあるデータを収集することがある。 Data collected from multiple institutions to build a single integrated machine learning model generally has differences in data distribution depending on each institution's customer base, region, business policy, etc. In other words, the data characteristics may differ between collected data. Or, differences in data distribution may be artificial due to differences in data collection and recording methods. In addition, differences in data distribution may exist between multiple collected data, not just from multiple institutions. In other words, diverse data may be collected.

このような、多様性のあるデータを用いて、機械学習モデルを構築することで、より正確な分析が可能になる。 By using such diverse data to build machine learning models, more accurate analysis becomes possible.

しかしながら、収集したデータにおけるデータ分布の差は、説明変数と目的変数の関係を複雑化し、統合機械学習モデルの予測精度低下の要因となっている。また、差異の特定や分析などのデータ分析者の工数増大にもつながっている。収集したデータそれぞれを用いて、機械学習モデルを構築した場合、各機関での再学習など管理に手間が掛かってしまう。 However, differences in data distribution in the collected data complicate the relationship between explanatory variables and objective variables, which reduces the predictive accuracy of the integrated machine learning model. It also leads to an increase in the workload of data analysts who identify and analyze differences. If a machine learning model is built using each piece of collected data, it will be time-consuming to manage it, including re-learning at each institution.

ここで、複数のデータを分析するための技術として、特許文献１が提案されている。特許文献１では複数のソースからデータを受信して分析をする際の事前処理を一律のプロセスに従って行っている。 Patent Document 1 proposes a technique for analyzing multiple data sets. In Patent Document 1, when receiving and analyzing data from multiple sources, pre-processing is performed according to a uniform process.

しかし、特許文献１では、データの補正は実施しておらず、データ分布の差に対応することが困難である。言い換えると、データ分布に差がある複数のデータを単純にマージして、機械学習モデルを構築した場合、ここで、補正とは、複数機関のデータを個別前処理したのちに合算することで、機関間のデータ分布の差異を抑えることを意味する。個別前処理は、機関ごとに前処理を行うことを表す。また、前処理は、事前処理のみならず、説明変数の数値そのものを変換することを表す。 However, in Patent Document 1, data correction is not performed, making it difficult to address differences in data distribution. In other words, when a machine learning model is constructed by simply merging multiple pieces of data with different data distributions, correction here means reducing differences in data distribution between institutions by combining data from multiple institutions after individually preprocessing them. Individual preprocessing refers to performing preprocessing for each institution. Preprocessing also refers to not only preprocessing, but also converting the numerical values of the explanatory variables themselves.

本発明は、データ分布の差がある複数のデータ、つまり、多様性のあるデータを収集して機械学習モデルを作成する場合に、多様性を考慮した機械学習モデルの構築やより正確な分析を可能とすることを課題とする。 The objective of the present invention is to enable the construction of a machine learning model that takes diversity into account and to perform more accurate analysis when creating a machine learning model by collecting multiple data sets with differences in data distribution, i.e., diverse data.

上記課題を解決するために本発明では、データ分析のための学習モデルを構築する学習モデル構築システムにおいて、データ分布に差がある複数機関データを記憶する記憶部と、前記複数機関データに対するデータ前処理における前処理条件を格納する格納部と、前記前処理条件を用いて、複数機関データそれぞれに対するデータ前処理に必要なパラメータを算出し、前記データ分析における説明変数ごと、前記複数機関データそれぞれに対して、複数のパラメータそれぞれを用いた、第１のデータ前処理のそれぞれを実行し、前記データ前処理ごとに、当該データ前処理が実行された各データを合算して、データ分析の結果である予測に対する有用性を示す指標値を算出し、当該指標値に基づいて、前記複数機関データそれぞれに対するデータ前処理から、所定のデータ前処理を、前記説明変数ごとに特定し、前記複数機関データそれぞれに対して、特定された前記説明変数ごとの第２のデータ前処理を実行する個別前処理実行部とを有する学習モデル構築システムを採用した。
In order to solve the above problems, the present invention employs a learning model construction system for constructing a learning model for data analysis, the learning model construction system having a memory unit that stores multi -institution data having differences in data distribution, a storage unit that stores preprocessing conditions for data preprocessing for the multi- institution data , and an individual preprocessing execution unit that calculates parameters required for data preprocessing for each of the multi- institution data using the preprocessing conditions, executes a first data preprocessing using each of the multiple institution data for each explanatory variable in the data analysis, calculates an index value indicating usefulness for prediction, which is a result of the data analysis, by adding up each of the data on which the data preprocessing has been executed for each of the data preprocessing, and specifies a predetermined data preprocessing for each of the explanatory variables from the data preprocessing for each of the multi-institution data based on the index value, and executes a second data preprocessing for each of the specified explanatory variables for each of the multiple institution data.

さらに、本発明では、学習モデル構築システムを用いた方法や、学習モデル構築システムをコンピュータしてその機能させるコンピュータプログラム製品も含まれる。 Furthermore, the present invention also includes a method using the learning model construction system, and a computer program product that causes the learning model construction system to function as a computer.

またさらに、本発明には、学習モデル構築システムで構築された学習モデルを利用た学習装置や予測装置も含まれる。その上、これら学習装置ないし予測装置を用いた方法、これら各装置をコンピュータとして機能させるコンピュータプログラム製品も含まれる。 Furthermore, the present invention also includes a learning device and a prediction device that utilize a learning model constructed by the learning model construction system. Furthermore, the present invention also includes a method using these learning devices or prediction devices, and a computer program product that causes each of these devices to function as a computer.

本発明によれば、データ分布の差に応じた機械学習モデルを構築することが可能になる。 The present invention makes it possible to build a machine learning model that responds to differences in data distribution.

実施例1における計算機システムのシステム構成図である。FIG. 1 is a system configuration diagram of a computer system according to a first embodiment. 実施例1における機械学習モデル学習時の全体処理フローを表すシーケンス図である。FIG. 1 is a sequence diagram showing an overall process flow during machine learning model learning in the first embodiment. 実施例1における機械学習モデルでの予測時の全体処理フローを表すシーケンス図である。FIG. 1 is a sequence diagram illustrating an overall process flow during prediction using a machine learning model in the first embodiment. 実施例1における学習時に学習用データを取得する際のフロー図である。FIG. 11 is a flow diagram for acquiring learning data during learning in the first embodiment. 実施例1における学習時にデータ補正条件を取得する際の処理フロー図である。FIG. 11 is a process flow diagram for acquiring data correction conditions during learning in the first embodiment. 実施例1における学習時にデータ補正条件に従って個別前処理を行う際の処理フロー図である。FIG. 11 is a process flow diagram when performing individual pre-processing according to data correction conditions during learning in the first embodiment. 実施例1における学習時に個別前処理後のデータを集計する際の処理フロー図である。FIG. 11 is a process flow diagram for aggregating data after individual pre-processing during learning in the first embodiment. 実施例1における補正後のデータを用いて機械学習モデルを学習する際の処理フロー図である。FIG. 11 is a process flow diagram for learning a machine learning model using corrected data in the first embodiment. 実施例1における予測時に予測用データを取得する際の処理フロー図である。FIG. 11 is a process flow diagram for acquiring prediction data during prediction in the first embodiment. 実施例1における格納されたパラメータ群を予測時に取得する際の処理フロー図である。FIG. 11 is a process flow diagram for acquiring a stored parameter group at the time of prediction in the first embodiment. 実施例1における予測時に個別前処理の処理フロー図である。FIG. 11 is a process flow diagram of individual pre-processing at the time of prediction in the first embodiment. 実施例1における予測時に補正後のデータと学習済みモデルを用いて予測を行う際の処理フロー図である。FIG. 11 is a process flow diagram for making prediction using corrected data and a trained model at the time of prediction in the first embodiment. 実施例1における学習時及び予測時に入力される機関情報付きデータの一例を示したものである。1 shows an example of data with engine information input at the time of learning and prediction in the first embodiment. 実施例1におけるデータ補正条件ファイルの一例を示したものである。4 illustrates an example of a data correction condition file in the first embodiment. 実施例1における学習時におけるデータ補正結果の入出力画面の一例を示した模式図である。FIG. 13 is a schematic diagram showing an example of an input/output screen for a data correction result during learning in the first embodiment. 実施例1における予測時におけるデータ補正結果の入出力画面の一例を示した模式図である。FIG. 13 is a schematic diagram showing an example of an input/output screen for a data correction result at the time of prediction in the first embodiment. 実施例2における、予測時のデータ補正結果の入出力画面の一例を示した模式図である。FIG. 11 is a schematic diagram showing an example of an input/output screen for a data correction result at the time of prediction in the second embodiment.

以下に、本発明を実施するための形態について詳細に説明する。 The following provides a detailed explanation of how to implement the present invention.

（システム構成）
まず、実施例1を説明する。図1は、実施例1における計算機システム1を示す図である。計算機システム1は学習用サーバ11、予測用サーバ12、学習用クライアントマシン13、予測用クライアントマシン14で構成され、これらは相互にネットワーク10を介して通信が可能な状態に接続されている。これらサーバ及びクライアントマシンは単一の装置上に構成されても、適切に切り離された装置上に構成されてもよく、用途に応じて適切に相互のアクセス可能領域を設定してよい。また、学習用サーバ11、予測用サーバ12および学習用クライアントマシン13を、いわゆるデータセンタに設置する態様も想定できる。この場合、クラウドサービスとして予測用クライアントマシン14で、学習結果や予測結果の利用が可能とある。このため、学習用サーバ11、予測用サーバ12および学習用クライアントマシン13を接続するネットワーク10として構内ネットワークを用いることが望ましい。また、学習用サーバ11、予測用サーバ12および学習用クライアントマシン13と、予測用クライアントマシン14を接続するネットワーク10として、インターネットのような広域ネットワークを用いることが望ましい。また、予測用クライアントマシン14が、サービスの提供を受けるので、図1には図示しないが、予測用クライアントマシン14はその機関ごとに複数設けられることが望ましい。 (System configuration)
First, a first embodiment will be described. FIG. 1 is a diagram showing a computer system 1 in the first embodiment. The computer system 1 is composed of a learning server 11, a prediction server 12, a learning client machine 13, and a prediction client machine 14, which are connected to each other via a network 10 in a state where they can communicate with each other. These servers and client machines may be configured on a single device or on appropriately separated devices, and mutually accessible areas may be set appropriately depending on the purpose. In addition, a mode in which the learning server 11, the prediction server 12, and the learning client machine 13 are installed in a so-called data center can be assumed. In this case, the learning results and prediction results can be used as a cloud service on the prediction client machine 14. For this reason, it is preferable to use an in-house network as the network 10 connecting the learning server 11, the prediction server 12, and the learning client machine 13. In addition, it is preferable to use a wide area network such as the Internet as the network 10 connecting the learning server 11, the prediction server 12, and the learning client machine 13 to the prediction client machine 14. Furthermore, since the prediction client machine 14 receives the service, it is desirable that a plurality of prediction client machines 14 are provided for each institution, although this is not shown in FIG.

次に、計算機システム1の各装置について説明する。学習用サーバ11は、機械学習モデル学習時にデータの受付や補正などを行う。学習用サーバ11には、汎用のサーバ装置が用いられ、CPU1110、メモリ1120、外部記憶装置1130、ネットワークI/F1140を有する。なお、本実施例では、学習時、予測時のように「時」を用いて説明するが、これらは学習あるいは予測の際、学習あるは予測する場合との意味であり、限定的な意味ではない。 Next, each device of the computer system 1 will be described. The learning server 11 accepts and corrects data when learning a machine learning model. A general-purpose server device is used for the learning server 11, and it has a CPU 1110, memory 1120, an external storage device 1130, and a network I/F 1140. Note that in this embodiment, "time" is used for the explanation, such as when learning and when predicting, but these mean when learning or predicting, and are not limited to this meaning.

メモリ1120は、学習部1121、格納部1122を有し、CPU1110との協働によってデータの補正、機械学習モデルの学習及びその過程の記録を行う。ここで、学習部1121は、プログラムとして実装されることが望ましい。このため、メモリ1120に展開された学習部1121の各構成に従って、CPU1110が演算を実行することで、以下に記載する各機能を実現する。学習部1121は、データ取得部11211、データ補正条件取得部11212、個別前処理実行部11213、全機関データ集計部11214、機械学習モデル学習部11215、出力部11216からなる。また、学習部1121は、データを取得し補正したうえ、機械学習モデルを学習する機能を提供する。 The memory 1120 has a learning unit 1121 and a storage unit 1122, and works in cooperation with the CPU 1110 to correct data, learn a machine learning model, and record the process. Here, it is preferable that the learning unit 1121 is implemented as a program. Therefore, the CPU 1110 executes calculations according to each configuration of the learning unit 1121 deployed in the memory 1120, thereby realizing each function described below. The learning unit 1121 is composed of a data acquisition unit 11211, a data correction condition acquisition unit 11212, an individual preprocessing execution unit 11213, an all-institution data aggregation unit 11214, a machine learning model learning unit 11215, and an output unit 11216. The learning unit 1121 also provides a function of acquiring and correcting data, and learning a machine learning model.

データ取得部11211では、後述の学習用クライアントマシン13の入力した学習用データを取得しそのフォーマットを確認し一時記憶する。データ補正条件取得部11212では、利用者の入力したデータ補正条件ファイルを取得し、後述のデータ補正条件格納部11221に記録したうえで一時記憶しておく。 The data acquisition unit 11211 acquires the learning data input by the learning client machine 13 described below, checks its format, and temporarily stores it. The data correction condition acquisition unit 11212 acquires the data correction condition file input by the user, records it in the data correction condition storage unit 11221 described below, and temporarily stores it.

個別前処理実行部11213では、データ補正条件取得部11212で一時記憶した補正条件を基にデータ取得部11211で記憶したデータを対象にデータ個別前処理を実行する。補正条件で指定された個別前処理手法候補の中から各説明変数に対し所定の条件に合致する個別前処理手法を自動で選択し、実行する機能を提供する。なお、所定の条件に合致するとは、最適なものを含む。この際、最適とは、その条件に最も適合するとの意味である。 The individual preprocessing execution unit 11213 executes individual data preprocessing on the data stored in the data acquisition unit 11211 based on the correction conditions temporarily stored in the data correction condition acquisition unit 11212. A function is provided to automatically select and execute an individual preprocessing method that matches the specified conditions for each explanatory variable from among the individual preprocessing method candidates specified in the correction conditions. Note that matching the specified conditions includes being optimal. In this case, optimal means being most suitable for the conditions.

また、最適な個別前処理手法及び個別前処理のために用いられたパラメータをそれぞれ後述の個別前処理内容格納部11223及び個別前処理用パラメータ格納部11222に記録する。 In addition, the optimal individual preprocessing method and the parameters used for the individual preprocessing are recorded in the individual preprocessing content storage unit 11223 and the individual preprocessing parameter storage unit 11222, respectively, which will be described later.

全機関データ集計部11214では個別前処理実行部11213で個別前処理された各機関データを集計する。機械学習モデル学習部11215では全機関データ集計部11214で集計された補正後の全機関データを機械学習モデルに入力し、学習済みモデルを後述の学習済み機械学習モデル格納部11224に記録する。用いられる機械学習モデルは特定のアルゴリズムに限定しない。出力部11216ではデータ補正結果及び学習結果を後述の学習用クライアントマシン13に出力する。なお、全機関データとは、対象とする機関のデータのそれぞれであればよく、狭義の「全」に限らない。また、本実施例では、各機関のデータを用いるが、本発明においては複数のデータであればよく、必ずしも異なる機関で収集されたデータでなくともよい。 The all-institution data aggregation unit 11214 aggregates the individual institution data preprocessed by the individual preprocessing execution unit 11213. The machine learning model learning unit 11215 inputs the corrected all-institution data aggregated by the all-institution data aggregation unit 11214 into a machine learning model, and records the trained model in the trained machine learning model storage unit 11224 described below. The machine learning model used is not limited to a specific algorithm. The output unit 11216 outputs the data correction results and the learning results to the learning client machine 13 described below. Note that all-institution data may refer to each of the target institutions' data, and is not limited to "all" in the narrow sense. In addition, in this embodiment, data from each institution is used, but in the present invention, multiple data may be used, and the data does not necessarily have to be collected by different institutions.

格納部1122は、データ補正条件格納部11221、個別前処理用パラメータ格納部11222、個別前処理内容格納部11223、学習済み機械学習モデル格納部11224からなる。そして、学習部1121で行われた学習時の補正条件、補正内容及び学習済みモデルを記憶する。このことで、予測時に学習時と同一のデータ補正を実施できるようにするための機能を提供する。予測時には後述の予測部1221からの要求に従って、記憶された各種ファイル群を、ネットワーク10を介するなどして提供する。データ補正条件格納部11221は、データ補正条件取得部11212で取得された、データ個別前処理手法として施行する手法の候補及び個別前処理手法間の優劣を判定する指標が記載されたファイルを格納する。個別前処理用パラメータ格納部11222では、個別前処理実行部11213で計算された、データ個別前処理を実行するために必要なパラメータを格納する。例えば個別前処理手法の候補に個別標準化が含まれる場合には、各説明変数、各機関ごとの平均値及び標準偏差が個別前処理用パラメータとして格納される。個別前処理内容格納部11223では個別前処理実行部11213で決定された最適な個別前処理手法を各説明変数毎に格納する。学習済み機械学習モデル格納部11224では機械学習モデル学習部11215で学習された機械学習モデルを格納する。 The storage unit 1122 is composed of a data correction condition storage unit 11221, an individual preprocessing parameter storage unit 11222, an individual preprocessing content storage unit 11223, and a learned machine learning model storage unit 11224. The storage unit 1122 stores the correction conditions, correction content, and learned model during learning performed by the learning unit 1121. This provides a function for performing the same data correction during prediction as during learning. During prediction, the stored various file groups are provided via the network 10, etc., in accordance with a request from the prediction unit 1221 described below. The data correction condition storage unit 11221 stores a file containing candidates for methods to be implemented as individual data preprocessing methods and an index for determining the superiority or inferiority between the individual preprocessing methods, which are acquired by the data correction condition acquisition unit 11212. The individual preprocessing parameter storage unit 11222 stores parameters required for executing individual data preprocessing, which are calculated by the individual preprocessing execution unit 11213. For example, if individual standardization is included as a candidate for the individual preprocessing method, the average value and standard deviation for each explanatory variable and each institution are stored as individual preprocessing parameters. The individual preprocessing content storage unit 11223 stores the optimal individual preprocessing method determined by the individual preprocessing execution unit 11213 for each explanatory variable. The trained machine learning model storage unit 11224 stores the machine learning model trained by the machine learning model training unit 11215.

予測用サーバ12は、機械学習モデルによる予測時に、データの受付、補正及び予測などを行う。予測用サーバ12には、汎用のサーバ装置が用いられ、CPU1210、メモリ1220、外部記憶装置1230、ネットワークI/F1240を有する。メモリ1220は予測部1221を有し、CPU1210との協働によって、入力されたデータに対する予測値を出力する。ここで、予測部1221は、プログラムとして実装される。このため、メモリ1220に展開された予測部1221の各構成に従って、CPU1210が演算を実行することで、以下に記載する各機能を実現する。 The prediction server 12 accepts data, corrects it, and makes predictions when making predictions using a machine learning model. A general-purpose server device is used for the prediction server 12, and it has a CPU 1210, a memory 1220, an external storage device 1230, and a network I/F 1240. The memory 1220 has a prediction unit 1221, and outputs a predicted value for input data in cooperation with the CPU 1210. Here, the prediction unit 1221 is implemented as a program. Therefore, the CPU 1210 executes calculations according to the configuration of the prediction unit 1221 deployed in the memory 1220, thereby realizing each of the functions described below.

予測部1221は、データ取得部12211、格納データ取得部12212、個別前処理実行部12213、全機関データ集計部12214、機械学習モデル予測部12215、出力部12216からなる。予測部1221は格納部1122から補正用パラメータを取得し、入力された予測対象データに対しデータの補正、機械学習モデルによる予測を行う。データ取得部12211は後述の予測用クライアントマシン14が入力した機関情報付きの予測対象データを取得し、フォーマットを確認したうえで一時記憶しておく。 The prediction unit 1221 consists of a data acquisition unit 12211, a stored data acquisition unit 12212, an individual preprocessing execution unit 12213, an all-institution data aggregation unit 12214, a machine learning model prediction unit 12215, and an output unit 12216. The prediction unit 1221 acquires correction parameters from the storage unit 1122, and performs data correction and prediction using a machine learning model on the input prediction target data. The data acquisition unit 12211 acquires prediction target data with institution information input by the prediction client machine 14 described below, and temporarily stores the data after checking the format.

格納データ取得部12212は、格納部1122に格納された各種個別前処理用パラメータ及び学習済み機械学習モデルを取得し、一時記憶しておく。個別前処理実行部12213では、格納データ取得部12212で記憶した個別前処理用パラメータを用いてデータ取得部12211で記憶したデータに個別前処理を実行する。全機関データ集計部12214では、個別前処理後の各機関データを結合する。これにより、学習時と同様の補正を予測用データに実行することができる。 The stored data acquisition unit 12212 acquires various individual preprocessing parameters and trained machine learning models stored in the storage unit 1122 and temporarily stores them. The individual preprocessing execution unit 12213 executes individual preprocessing on the data stored in the data acquisition unit 12211 using the individual preprocessing parameters stored in the stored data acquisition unit 12212. The all-institution data aggregation unit 12214 combines the individual institution data after individual preprocessing. This allows the same corrections as during learning to be performed on the prediction data.

機械学習モデル予測部12215では、格納データ取得部12212で記憶した学習済み機械学習モデルを用いて、個別前処理実行部12213で補正された予測用データに対して予測を実行する。出力部12216は、データ補正結果及び予測結果を後述の予測用クライアントマシン14に出力する。 The machine learning model prediction unit 12215 uses the trained machine learning model stored in the stored data acquisition unit 12212 to perform prediction on the prediction data corrected by the individual preprocessing execution unit 12213. The output unit 12216 outputs the data correction results and the prediction results to the prediction client machine 14 described below.

学習用クライアントマシン13は、学習時に学習用のデータや補正用パラメータを学習用サーバ11に送信し、学習を進めるための操作を行うI/Fを提供する。学習用クライアントマシン13には汎用の計算機装置が用いられ、CPU1310、メモリ1320、外部記憶装置1330、ネットワークI/F1340を有する。 The learning client machine 13 transmits learning data and correction parameters to the learning server 11 during learning, and provides an I/F for performing operations to advance learning. The learning client machine 13 is a general-purpose computer device, and has a CPU 1310, memory 1320, external storage device 1330, and network I/F 1340.

メモリ1320、は学習用クライアント部1321を有し、CPU1310と協働し、学習用サーバ11と通信することで学習を行う。ここで、学習用クライアント部1321は、プログラムとして実装される。このため、メモリ1320に展開された学習用クライアント部1321の各構成に従って、CPU1310が演算を実行することで、以下に記載する各機能を実現する。学習用クライアント部1321では、ユーザーの操作に従って入力される複数機関データ及びデータ補正条件を受け取り、学習用サーバ11に送信する。また、学習用クライアント部1321は、これらを基に学習用サーバ11で行われた学習結果を受け取り、結果の図示や学習の再実行等を可能にするI/Fを提供する。 The memory 1320 has a learning client unit 1321, which cooperates with the CPU 1310 and communicates with the learning server 11 to perform learning. Here, the learning client unit 1321 is implemented as a program. Therefore, the CPU 1310 executes calculations according to each configuration of the learning client unit 1321 deployed in the memory 1320, thereby realizing each of the functions described below. The learning client unit 1321 receives multiple institution data and data correction conditions input in accordance with user operations, and transmits them to the learning server 11. The learning client unit 1321 also receives the learning results performed by the learning server 11 based on these, and provides an I/F that enables the results to be illustrated, learning to be re-executed, etc.

予測用クライアントマシン14は、予測時に予測用の機関情報付きデータを予測用サーバ12に送信し、予測を行うための操作を行うI/Fを提供する。予測用クライアントマシン14には汎用の計算機装置が用いられ、CPU1410、メモリ1420、外部記憶装置1430、ネットワークI/F1440を有する。 The prediction client machine 14 transmits data with prediction information to the prediction server 12 during prediction, and provides an I/F for performing operations to make predictions. The prediction client machine 14 uses a general-purpose computer device and has a CPU 1410, memory 1420, external storage device 1430, and network I/F 1440.

メモリ1420は、予測用クライアント部1421を有し、CPU1410と協働し、予測用サーバ12と通信することで予測を行う。ここで、予測用クライアント部1421は、プログラムとして実装される。このため、メモリ1420に展開された予測用クライアント部1421の各構成に従って、CPU1410が演算を実行することで、以下に記載する各機能を実現する。 The memory 1420 has a prediction client unit 1421, which cooperates with the CPU 1410 and communicates with the prediction server 12 to perform predictions. Here, the prediction client unit 1421 is implemented as a program. Therefore, the CPU 1410 executes calculations according to the configuration of the prediction client unit 1421 deployed in the memory 1420, thereby realizing each of the functions described below.

予測用クライアント部1421では、ユーザーの操作に従って入力される機関情報付きデータを受け取り、予測用サーバ12に送信する。また、予測用クライアント部1421は、予測用サーバ12で行われた予測結果を受け取り、結果の図示等を可能にするI/Fを提供する。
（学習時の全体処理フロー）
次に、図2を用いて機械学習モデル学習時の全体処理フローを説明する。まず、学習用クライアントマシン13の学習用クライアント部1321が、複数機関データ及びデータ補正条件を、学習部1121のデータ取得部11211及びデータ補正条件取得部11212へと送信する(S101)。 The prediction client unit 1421 receives data with engine information input in accordance with user operations, and transmits the data to the prediction server 12. The prediction client unit 1421 also receives prediction results performed by the prediction server 12, and provides an I/F that enables the results to be displayed graphically, etc.
(Overall process flow during learning)
Next, the overall process flow during machine learning model learning will be described with reference to Fig. 2. First, the learning client unit 1321 of the learning client machine 13 transmits multiple institution data and data correction conditions to the data acquisition unit 11211 and the data correction condition acquisition unit 11212 of the learning unit 1121 (S101).

次に、データ取得部11211は、S101で取得したデータを一時記憶する。また、データ補正条件取得部11212は、S101で受け取ったデータ補正条件を格納部1122のデータ補正条件格納部11221に登録する(S102)。 Next, the data acquisition unit 11211 temporarily stores the data acquired in S101. In addition, the data correction condition acquisition unit 11212 registers the data correction conditions received in S101 in the data correction condition storage unit 11221 of the storage unit 1122 (S102).

次に、学習部1121の個別前処理実行部11213は、個別前処理に必要なパラメータを算出する(S103)。そして、個別前処理実行部11213は、算出したパラメータを格納部1122の個別前処理用パラメータ格納部11222に登録する(S104)。 Next, the individual pre-processing execution unit 11213 of the learning unit 1121 calculates parameters required for the individual pre-processing (S103). Then, the individual pre-processing execution unit 11213 registers the calculated parameters in the individual pre-processing parameter storage unit 11222 of the storage unit 1122 (S104).

個別前処理実行部11213は、これらパラメータとデータ補正条件を基に個別前処理内容を決定する(S105)。そして、個別前処理実行部11213は、決定した個別前処理内容を、個別前処理内容格納部11223に登録する(S106)。また、個別前処理実行部11213は、決定した個別前処理内容に従って個別前処理を実行する(S107)。 The individual preprocessing execution unit 11213 determines the individual preprocessing contents based on these parameters and the data correction conditions (S105). Then, the individual preprocessing execution unit 11213 registers the determined individual preprocessing contents in the individual preprocessing contents storage unit 11223 (S106). Furthermore, the individual preprocessing execution unit 11213 executes the individual preprocessing according to the determined individual preprocessing contents (S107).

次に、全機関データ集計部11214は、S107で個別前処理されたデータを結合する(S108)。そして、機械学習モデル学習部11215は、結合後のデータを機械学習モデルに入力して学習を行う(S109)。そして、機械学習モデル学習部11215は、S109で学習された学習済みのモデルを、格納部1122の学習済み機械学習モデル格納部11224に登録する(S110)。 Next, the all-institution data aggregation unit 11214 combines the data that was individually preprocessed in S107 (S108). Then, the machine learning model learning unit 11215 inputs the combined data into the machine learning model and performs learning (S109). Then, the machine learning model learning unit 11215 registers the trained model trained in S109 in the trained machine learning model storage unit 11224 of the storage unit 1122 (S110).

これを受けて、出力部11226は、上述の処理で特定される補正結果を学習用クライアント部1321に出力する(S111)。この結果、学習用クライアントマシン13では、その結果を表示装置に表示可能となる。以上によってユーザーは複数機関から収集したデータを適切に補正したのちに機械学習モデルを学習させることができる。
（予測時の全体処理フロー）
次に、図3を用いて機械学習モデル予測時の全体処理フローを説明する。まず、予測用クライアントマシン14の予測用クライアント部1421から予測用サーバ12上のデータ取得部12211へ機関情報付きの予測対象データを送信する(S201)。 In response to this, the output unit 11226 outputs the correction result identified by the above-mentioned process to the learning client unit 1321 (S111). As a result, the learning client machine 13 can display the result on the display device. In this way, the user can train the machine learning model after appropriately correcting the data collected from multiple institutions.
(Overall processing flow during prediction)
Next, the overall process flow during machine learning model prediction will be described with reference to Fig. 3. First, prediction target data with engine information is transmitted from the prediction client unit 1421 of the prediction client machine 14 to the data acquisition unit 12211 on the prediction server 12 (S201).

次に、データ取得部12211は、S201で取得したデータを一時記憶しておく。また、予測用サーバ12上の格納データ取得部12212から学習用サーバ11上の格納部1122へ個別前処理用パラメータ、個別前処理内容及び学習済み機械学習モデルの送信要求を送る(S202)。 Next, the data acquisition unit 12211 temporarily stores the data acquired in S201. In addition, the stored data acquisition unit 12212 on the prediction server 12 sends a request to send the individual preprocessing parameters, the individual preprocessing details, and the trained machine learning model from the stored data acquisition unit 12212 on the learning server 11 to the storage unit 1122 on the learning server 11 (S202).

次に、格納部1122は、格納データ取得部12212からの要求に応じて、ファイル群を格納データ取得部12212へ出力する(S203)。また、格納データ取得部12212は、これらファイル群を一時記憶しておく。 Next, in response to a request from the stored data acquisition unit 12212, the storage unit 1122 outputs the file group to the stored data acquisition unit 12212 (S203). In addition, the stored data acquisition unit 12212 temporarily stores the file group.

次に、個別前処理実行部12213は、格納データ取得部12212で一時記憶された個別前処理内容を用いて、データ取得部12211で一時記憶されたデータを対象に、個別前処理を実行する(S204)。 Next, the individual pre-processing execution unit 12213 executes individual pre-processing on the data temporarily stored in the data acquisition unit 12211 using the individual pre-processing content temporarily stored in the stored data acquisition unit 12212 (S204).

次に、全機関データ集計部12214は、S204で個別前処理されたデータを結合する(S205)。以上によって、学習時に決定されたものと同一のデータ補正を入力された予測対象データに施すことができる。 Next, the all-institution data aggregation unit 12214 combines the data that was individually preprocessed in S204 (S205). As a result, the same data correction that was determined during learning can be applied to the input prediction target data.

また、機械学習モデル予測部12215では、格納データ取得部12212で取得した学習済み機械学習モデルを用いて結合後の予測対象データを対象に予測値を算出する(S206)。これを受けて、出力部12226は、補正結果および予測結果を予測用クライアント部1421に出力する(S207)。このことで、予測用クライアントマシン14では、表示装置に補正結果および予測結果を表示することができる。以上によってユーザーは機関情報の付与された予測対象データに対して、学習時と同一の補正を施し、学習済み機械学習モデルによる予測結果を得ることができる。
（学習時の各詳細処理フロー）
以下、学習時の詳細フローを説明する。まず、図4を用いて学習時に学習データを取得する際のデータ取得部11211における処理フローを詳細に説明する。つまり、図3のS102の詳細を説明する。まず、データ取得部11211は、学習用クライアントマシン13から所属機関情報が付与された学習用データを取得する(S301)。次に、データ取得部11211は、取得したデータのフォーマット等をチェックし、後続の処理に問題のある場合は適切な警告を学習用クライアントマシン13に送信する(S302)。例えば機関ごとにデータの形式が統一されているかどうかなどがチェックの対象となるが、これだけに限定されるものではない。最後に、データ取得部11211は、受け付けたデータを一時的に記憶しておく(S303)。この一時的に記憶されたデータを用いる個別前処理実行部11213の処理は、図6を用いて後述する。 Furthermore, the machine learning model prediction unit 12215 calculates a predicted value for the prediction target data after the combination using the trained machine learning model acquired by the stored data acquisition unit 12212 (S206). In response to this, the output unit 12226 outputs the correction result and the prediction result to the prediction client unit 1421 (S207). This allows the prediction client machine 14 to display the correction result and the prediction result on the display device. In this way, the user can apply the same correction as during learning to the prediction target data to which the institution information has been added, and obtain the prediction result using the trained machine learning model.
(Detailed process flow during learning)
A detailed flow during learning will be described below. First, the process flow in the data acquisition unit 11211 when acquiring learning data during learning will be described in detail with reference to FIG. 4. That is, the details of S102 in FIG. 3 will be described. First, the data acquisition unit 11211 acquires learning data to which affiliation information is attached from the learning client machine 13 (S301). Next, the data acquisition unit 11211 checks the format of the acquired data, and if there is a problem with the subsequent processing, transmits an appropriate warning to the learning client machine 13 (S302). For example, the data acquisition unit 11211 checks whether the data format is unified for each institution, but is not limited to this. Finally, the data acquisition unit 11211 temporarily stores the received data (S303). The processing of the individual pre-processing execution unit 11213 using this temporarily stored data will be described later with reference to FIG. 6.

次に、図5を用いて学習時の補正条件を取得する際のデータ補正条件取得部11212における処理フローを詳細に説明する。つまり、図3のS102の詳細を説明する。 Next, the process flow of the data correction condition acquisition unit 11212 when acquiring the correction conditions during learning will be described in detail with reference to FIG. 5. In other words, the details of S102 in FIG. 3 will be described.

まず、データ補正条件取得部11212は、学習用クライアントマシン13からデータ補正条件ファイルを取得する(S401)。次に、データ補正条件取得部11212は、取得したデータをデータ補正条件格納部11221に記録し(S402)、データ補正条件を一時的に記憶しておく(S403)。このデータ補正条件を用いる個別前処理実行部11213の処理は、図6を用いて次に説明する。 First, the data correction condition acquisition unit 11212 acquires a data correction condition file from the learning client machine 13 (S401). Next, the data correction condition acquisition unit 11212 records the acquired data in the data correction condition storage unit 11221 (S402) and temporarily stores the data correction conditions (S403). The processing of the individual pre-processing execution unit 11213 that uses these data correction conditions will be explained next with reference to FIG. 6.

次に、図6を用いて学習時における個別前処理実行部11213における処理フローを詳細に説明する。 Next, the processing flow of the individual preprocessing execution unit 11213 during learning will be explained in detail using Figure 6.

まず、個別前処理実行部11213は、データ取得部11211で一時記憶した所属機関情報付き学習用データを対象にパラメータを算出する(S501)。ここで、パラメータとは、データ補正条件取得部11212で一時記憶したデータ補正条件中に記載された試行対象となる個別前処理に必要なパラメータである。ここで、上記学習用データは、S303で記憶されたものである。また、上記データ補正条件は、S403で記憶されたものである。なお、個別前処理実行部11213は、算出されたパラメータを、個別前処理用パラメータとして一時記憶しておく。この際、個別前処理実行部11213は、試行対象となる個別前処理として、機関別の平均値と標準偏差を用いた個別標準化や、機関別の最小値と最大値を用いた個別正規化を用いることが可能である。 First, the individual preprocessing execution unit 11213 calculates parameters for the learning data with affiliated institution information temporarily stored in the data acquisition unit 11211 (S501). Here, the parameters are parameters required for the individual preprocessing to be trialed, which are described in the data correction conditions temporarily stored in the data correction condition acquisition unit 11212. Here, the learning data is stored in S303. The data correction conditions are stored in S403. The individual preprocessing execution unit 11213 temporarily stores the calculated parameters as parameters for individual preprocessing. At this time, the individual preprocessing execution unit 11213 can use individual standardization using the average value and standard deviation for each institution, or individual normalization using the minimum and maximum values for each institution, as the individual preprocessing to be trialed.

次に、個別前処理実行部11213は、一時記憶された個別前処理用パラメータを個別前処理用パラメータ格納部11222に記録する(S502)。 Next, the individual pre-processing execution unit 11213 records the temporarily stored individual pre-processing parameters in the individual pre-processing parameter storage unit 11222 (S502).

そして、個別前処理実行部11213は、以下の処理S504、S505を個別前処理の対象となる説明変数すべてに対して処理が完了するまで行う(S503)。 Then, the individual preprocessing execution unit 11213 performs the following processes S504 and S505 until the processes are completed for all explanatory variables subject to individual preprocessing (S503).

個別前処理実行部11213は、未処理変数から一つの未処理変数を選択し、S501で先述した試行対象となる個別前処理を行う個別前処理手法のそれぞれを仮実行する(S504)。なお、仮実行とは、後述するS506での個別前処理との区別のための名称であり、個別前処理の内容はこれと同等である。この際、個別前処理実行部11213は、S501で算出、記憶された個別前処理用パラメータを使用する。 The individual preprocessing execution unit 11213 selects one unprocessed variable from the unprocessed variables, and provisionally executes each of the individual preprocessing methods that perform the individual preprocessing to be tried in S501 (S504). Note that the term "provisional execution" is a name used to distinguish it from the individual preprocessing in S506, which will be described later, and the content of the individual preprocessing is the same. At this time, the individual preprocessing execution unit 11213 uses the individual preprocessing parameters calculated and stored in S501.

次に、個別前処理実行部11213は、データ補正条件において指定された指標値を各試行対象個別前処理手法に対して計算する。そして、個別前処理実行部11213は、指標値の最大値を与える個別前処理手法を個別前処理内容格納部11223に一時記憶する(S505)。なお、個別前処理内容格納部11223に格納される個別前処理手法は、これを特定できる情報であればよい。ここで、演算内容や個別前処理済のデータ（仮実行された結果）は記憶されないか、都度削除される。つまり、S504、S505において、個別前処理実行部11213は、個別前処理手法のうちその際指標値が最大であるものを特定する情報を都度記憶し、演算内容等は削除される等して、記憶容量を効率的に利用する。なお、ここで記録される個別前処理手法（特定する情報）は、図14の最適個別前処理3302として実装される。 Next, the individual preprocessing execution unit 11213 calculates the index value specified in the data correction conditions for each individual preprocessing method to be tried. Then, the individual preprocessing execution unit 11213 temporarily stores the individual preprocessing method that gives the maximum index value in the individual preprocessing content storage unit 11223 (S505). Note that the individual preprocessing method stored in the individual preprocessing content storage unit 11223 may be information that can identify it. Here, the calculation content and the individually preprocessed data (provisionally executed results) are not stored or are deleted each time. In other words, in S504 and S505, the individual preprocessing execution unit 11213 stores information that identifies the individual preprocessing method with the maximum index value at that time each time, and deletes the calculation content, etc., to efficiently use the storage capacity. Note that the individual preprocessing method recorded here (information to identify it) is implemented as the optimal individual preprocessing 3302 in FIG. 14.

なお、このS505においては、個別前処理実行部11213は、個別前処理手法そのものに対する評価を実行していることになる。 In addition, in S505, the individual preprocessing execution unit 11213 is performing an evaluation of the individual preprocessing method itself.

次に、個別前処理実行部11213は、予め設定された各説明変数に対し処理S504、S505が終了したかを判断する(S503)。この結果、終了していない場合、S504に戻る。また、終了していたら、個別前処理実行部11213は、各説明変数に対し、決定された最適な個別前処理を実施する(S507)。本実施例では、個別前処理実行部11213は、最適な個別前処理として、指標値が最大値を示すものを用いる。つまり、本ステップでは、個別前処理実行部11213は、最適な個別前処理として、仮実行したうち所定の個別前処理を再計算することになる。また、指標値の一例として、相関係数を用いることも可能である。 Next, the individual preprocessing execution unit 11213 determines whether the processes S504 and S505 have been completed for each explanatory variable that has been set in advance (S503). If the processes have not been completed, the process returns to S504. If the processes have been completed, the individual preprocessing execution unit 11213 executes the determined optimal individual preprocessing for each explanatory variable (S507). In this embodiment, the individual preprocessing execution unit 11213 uses the optimal individual preprocessing that has the maximum index value. In other words, in this step, the individual preprocessing execution unit 11213 recalculates a specific individual preprocessing that has been provisionally executed as the optimal individual preprocessing. It is also possible to use a correlation coefficient as an example of the index value.

以上によって、データ補正条件に指定された試行対象個別前処理手法のリストの中で、データ補正条件に指定された指標値を最大化する個別前処理手法が各説明変数に対し実行される。そして、個別前処理実行部11213は、個別前処理後の各機関データを、次に処理を行う全機関データ集計部11214に出力する(S507)。以降の全機関データ集計部11214の処理は、図7を用いて次に説明する。 As a result, from the list of individual preprocessing methods to be tried that are specified in the data correction conditions, the individual preprocessing method that maximizes the index value specified in the data correction conditions is executed for each explanatory variable. Then, the individual preprocessing execution unit 11213 outputs each institution's data after individual preprocessing to the all-institution data aggregation unit 11214, which performs the next process (S507). The subsequent processing of the all-institution data aggregation unit 11214 will be explained next with reference to FIG. 7.

なお、S506において、個別前処理実行部11213は、S504の仮実行の結果を流用してもよい。この場合、個別前処理実行部11213は、処理行った各前処理結果を、格納部1122に都度一時記憶する。そして、個別前処理実行部11213は、S503で各説明変数に対し処理S504、S505が終了したと判断した場合、一時記憶された各前処理結果のうち、指標値の最大値を与える個別前処理手法の結果を流用する。なお、S507においては、流用よりも決定された最適な個別前処理を実行する方が、メモリを効率的に利用できる、との利点がある。 In S506, the individual preprocessing execution unit 11213 may reuse the results of the provisional execution in S504. In this case, the individual preprocessing execution unit 11213 temporarily stores the results of each preprocessing performed in the storage unit 1122 each time. Then, when the individual preprocessing execution unit 11213 determines in S503 that the processing in S504 and S505 has been completed for each explanatory variable, it reuses the results of the individual preprocessing method that gives the maximum index value from among the temporarily stored preprocessing results. In S507, there is an advantage in that memory can be used more efficiently by executing the determined optimal individual preprocessing rather than reusing.

次に、図7を用いて、学習時の個別前処理後のデータを集計する際の全機関データ集計部11214における処理フローを詳細に説明する。まず、全機関データ集計部11214は、個別前処理実行部11213から個別前処理後のデータを取得する(S601)。つまり、全機関データ集計部11214は、図6のS507で出力されたデータを取得する。 Next, the processing flow in the all-institution data aggregation unit 11214 when aggregating data after individual preprocessing during learning will be described in detail with reference to FIG. 7. First, the all-institution data aggregation unit 11214 acquires data after individual preprocessing from the individual preprocessing execution unit 11213 (S601). In other words, the all-institution data aggregation unit 11214 acquires the data output in S507 of FIG. 6.

次に、全機関データ集計部11214は、取得した各データを結合する(S602)。次に、全機関データ集計部11214は、必要に応じてデータ整合性のチェックを行う(S603)。整合性のチェックには、例えば、全機関データ集計部11214が、本実施例の目的とする機関ごとのデータ分布の違いがユーザーの意図通りに補正されているか否かを判定することが含まれる。なお、全機関データ集計部11214は、このチェック結果を、学習用クライアントマシン13に通知してもよい。また、S603の処理は、省略してもよい。以上によって、機関毎のデータ分布の違いが補正された学習データセットが生成される。 Next, the all-institution data aggregation unit 11214 combines the acquired data (S602). Next, the all-institution data aggregation unit 11214 checks data consistency as necessary (S603). The consistency check includes, for example, the all-institution data aggregation unit 11214 determining whether the differences in data distribution between institutions, which is the objective of this embodiment, have been corrected as intended by the user. The all-institution data aggregation unit 11214 may notify the learning client machine 13 of the check result. Also, the processing of S603 may be omitted. As a result of the above, a learning dataset in which the differences in data distribution between institutions have been corrected is generated.

そして、全機関データ集計部11214は、結合されたデータを機械学習モデル学習部11215に出力する(S604)。この結合されたデータを用いる機械学習モデル学習部11215の処理は、図8を用いて次に説明する。 Then, the all-institution data aggregation unit 11214 outputs the combined data to the machine learning model learning unit 11215 (S604). The processing of the machine learning model learning unit 11215 using this combined data will be explained next with reference to FIG. 8.

次に、図8を用いて学習時における機械学習モデル学習部11215における処理フローを詳細に説明する。まず、機械学習モデル学習部11215は、全機関データ集計部11214から出力されるデータである集計後のデータを取得する(S701)。 Next, the process flow in the machine learning model learning unit 11215 during learning will be described in detail with reference to FIG. 8. First, the machine learning model learning unit 11215 acquires the aggregated data, which is the data output from the all-institution data aggregation unit 11214 (S701).

次に、機械学習モデル学習部11215は、集計後のデータを機械学習モデルに入力することで学習を行う(S702)。この際、機械学習モデルのアルゴリズムは特定のものに限定されない。 Next, the machine learning model learning unit 11215 performs learning by inputting the aggregated data into the machine learning model (S702). At this time, the algorithm of the machine learning model is not limited to a specific one.

最後に、機械学習モデル学習部11215は、学習済みの機械学習モデルを学習済み機械学習モデル格納部11224に記録する(S703)。 Finally, the machine learning model learning unit 11215 records the trained machine learning model in the trained machine learning model storage unit 11224 (S703).

その後、出力部11216において学習時に行われた補正の結果を、学習用クライアント部1321に出力する。出力の際のI/F、つまり表示装置への表示の例については、図15を用いて後述する。 Then, the output unit 11216 outputs the results of the corrections made during learning to the learning client unit 1321. An example of the I/F used for output, i.e., display on a display device, will be described later with reference to FIG. 15.

以上で、学習時の詳細フローの説明を終了し、以下、予測時の詳細フローを説明する。
（予測時の各詳細処理フロー）
まず、図9を用いて、予測時におけるデータ取得部12211における処理フローを詳細に説明する。まず、データ取得部12211は、予測用クライアントマシン14から所属機関情報が付与された予測対象データを取得する(S801)。 This concludes the detailed explanation of the flow during learning. Below, we will explain the detailed flow during prediction.
(Detailed process flow during prediction)
First, the process flow in the data acquisition unit 12211 at the time of prediction will be described in detail with reference to Fig. 9. First, the data acquisition unit 12211 acquires prediction target data to which affiliation information has been added from the prediction client machine 14 (S801).

次に、データ取得部12211は、取得したデータのフォーマット等をチェックする(S802)。ここでは、学習時に与えられたデータフォーマット(S302)と同一であるかどうかを確認する必要があるが、これ以外に追加のチェックが含まれてもよい。データ取得部12211は、チェックを行ったデータを個別前処理実行部12213に出力する(S803)。この出力されたデータを用いた格納データ取得部12212の処理を、図12を用いて後述する。 Next, the data acquisition unit 12211 checks the format of the acquired data (S802). Here, it is necessary to confirm whether the data format is the same as the data format given at the time of learning (S302), but additional checks may be included. The data acquisition unit 12211 outputs the checked data to the individual preprocessing execution unit 12213 (S803). The processing of the stored data acquisition unit 12212 using this output data will be described later with reference to FIG. 12.

次に、図10を用いて、予測時の格納データ取得部12212における処理フローを詳細に説明する。まず、格納データ取得部12212は、格納部1122から個別前処理内容、個別前処理用パラメータ、学習済み機械学習モデルを読みだす(S901)。なお、個別前処理内容、個別前処理用パラメータ、学習済み機械学習モデルのそれぞれは、個別前処理内容格納部11223、個別前処理用パラメータ格納部11222、学習済み機械学習モデル格納部11224に格納されている。 Next, the processing flow in the stored data acquisition unit 12212 at the time of prediction will be described in detail with reference to FIG. 10. First, the stored data acquisition unit 12212 reads out the individual preprocessing content, the individual preprocessing parameters, and the trained machine learning model from the storage unit 1122 (S901). The individual preprocessing content, the individual preprocessing parameters, and the trained machine learning model are stored in the individual preprocessing content storage unit 11223, the individual preprocessing parameter storage unit 11222, and the trained machine learning model storage unit 11224, respectively.

そして、格納データ取得部12212は、読みだした個別前処理内容及び個別前処理用パラメータを、次の処理（図11）を行う個別前処理実行部12213に出力する。また、格納データ取得部12212は、学習済み機械学習モデルを後述の処理（図12）を行う機械学習モデル予測部12215に出力する(S902)。 Then, the stored data acquisition unit 12212 outputs the read individual preprocessing contents and individual preprocessing parameters to the individual preprocessing execution unit 12213, which performs the next process (Figure 11). The stored data acquisition unit 12212 also outputs the trained machine learning model to the machine learning model prediction unit 12215, which performs the process described below (Figure 12) (S902).

次に、図11を用いて予測時における個別前処理実行部12213の処理フローを詳細に説明する。まず、個別前処理実行部12213は、データ取得部12211からデータを、格納データ取得部12212から個別前処理内容及び個別前処理用パラメータを取得する(S1001)。つまり、個別前処理実行部12213は、図10のS902で出力される個別前処理内容及び個別前処理用パラメータを取得する。 Next, the processing flow of the individual preprocessing execution unit 12213 during prediction will be described in detail with reference to FIG. 11. First, the individual preprocessing execution unit 12213 acquires data from the data acquisition unit 12211, and individual preprocessing contents and individual preprocessing parameters from the stored data acquisition unit 12212 (S1001). In other words, the individual preprocessing execution unit 12213 acquires the individual preprocessing contents and individual preprocessing parameters output in S902 of FIG. 10.

次に、個別前処理実行部12213は、以下の処理S1003を個別前処理の対象となる説明変数それぞれに対して処理が完了するまで行う(S1002)。個別前処理実行部12213は、未処理変数から一つの未処理変数を選択し、上述の個別前処理内容及び個別前処理用パラメータに従って個別前処理を実行する(S1003)。つまり、個別前処理実行部12213は、学習時における図5の504やSS506と同様の個別前処理を実行する。 Next, the individual preprocessing execution unit 12213 performs the following process S1003 for each explanatory variable that is the target of individual preprocessing until the process is completed (S1002). The individual preprocessing execution unit 12213 selects one unprocessed variable from the unprocessed variables, and executes individual preprocessing according to the individual preprocessing content and individual preprocessing parameters described above (S1003). In other words, the individual preprocessing execution unit 12213 executes individual preprocessing similar to 504 and SS506 in Figure 5 during learning.

次に、個別前処理実行部12213は、対象となる説明変数それぞれに対して処理S1003が終了したら、個別前処理済みのデータを次の全機関データ集計部12214に出力する(S1004)。 Next, when the individual preprocessing execution unit 12213 has completed processing S1003 for each of the target explanatory variables, it outputs the individually preprocessed data to the next all-institution data aggregation unit 12214 (S1004).

次に、測時には、全機関データ集計部12214では個別前処理実行部12213から個別前処理済みデータを受け取り、図7に示した学習時における全機関データ集計部11214と同様の処理フローによってデータを結合する。これによって、学習時と同一の補正を予測対象データに実施することができる。個別前処理実行部12213は、結合されたデータを次の処理（図12）を行う機械学習モデル予測部12215に出力する。 Next, at the time of measurement, the all-institution data aggregation unit 12214 receives the individual preprocessed data from the individual preprocessing execution unit 12213 and combines the data using a processing flow similar to that of the all-institution data aggregation unit 11214 during learning shown in Figure 7. This makes it possible to apply the same corrections to the prediction target data as during learning. The individual preprocessing execution unit 12213 outputs the combined data to the machine learning model prediction unit 12215, which performs the next process (Figure 12).

次に、図12を用いて予測時の機械学習モデル予測部12215における処理フローを詳細に説明する。まず、機械学習モデル予測部12215は、全機関データ集計部12214から個別前処理済み集計データを、格納データ取得部12212から学習済み機械学習モデルをそれぞれ受け取る(S1101)。 Next, the processing flow in the machine learning model prediction unit 12215 at the time of prediction will be described in detail with reference to FIG. 12. First, the machine learning model prediction unit 12215 receives the individual preprocessed aggregated data from the all-institution data aggregation unit 12214 and the trained machine learning model from the stored data acquisition unit 12212 (S1101).

次に、機械学習モデル予測部12215は、学習済み機械学習モデルに集計データを入力し、予測値を算出する(S1102)。 Next, the machine learning model prediction unit 12215 inputs the aggregated data into the trained machine learning model and calculates a predicted value (S1102).

最後に、機械学習モデル予測部12215はは、予測結果を次の出力部12216に送る(S1103)。 Finally, the machine learning model prediction unit 12215 sends the prediction result to the next output unit 12216 (S1103).

その後、出力部12216において予測時に行われた補正の結果及び機械学習モデルによる予測結果を、予測用クライアント部1421に出力する。出力の際のI/F、つまり表示装置への表示の例については図16を用いて後述する。 Then, the output unit 12216 outputs the result of the correction performed at the time of prediction and the prediction result by the machine learning model to the prediction client unit 1421. An example of the I/F at the time of output, that is, the display on the display device, will be described later with reference to FIG. 16.

以上で、予測時の各詳細処理フローの説明を終了し、次に、本実施例で用いられる各種データについて説明する。
（各種データ）
まず、図13に、実施例1における学習用データ及び予測対象データの一例を示す。テーブル2000は、データ例を表の形に表したものである。例として、年収2003、家族人数2004、年齢2005の3種の説明変数から何らかの二値の目的変数2006の値を予測するタスクを、機関A、機関Bの二つの機関名2002からデータを集めて実行するケースを考える。 This concludes the detailed description of each process flow during prediction. Next, various data used in this embodiment will be described.
(Various data)
13 shows an example of learning data and prediction target data in the first embodiment. A table 2000 shows an example of data in the form of a table. As an example, consider a case where a task of predicting the value of some binary objective variable 2006 from three explanatory variables, namely, annual income 2003, number of family members 2004, and age 2005, is executed by collecting data from two institution names 2002, institution A and institution B.

ID2001は、各データを識別する情報であり、例えば各データに付与された通し番号である。機関名2002で示される機関の数は2に限らず、一般に多数の機関からデータを収集して本実施例の内容を適用してよい。また、目的変数2006は二値変数に限らず、任意の分類、回帰などの教師あり学習、及びクラスタリングなどの教師無し学習のタスクでもよい。 ID 2001 is information for identifying each data, for example, a serial number assigned to each data. The number of institutions indicated in institution name 2002 is not limited to two, and the contents of this embodiment may generally be applied by collecting data from a large number of institutions. In addition, objective variable 2006 is not limited to a binary variable, and may be any classification, supervised learning such as regression, or unsupervised learning task such as clustering.

なお、ここで挙げた例においては、機関Aに属するデータは機関Bのデータに比べて年収が低く、家族人数が多く、年齢が高い傾向にあり、本発明が対象とする、機関の間にデータ分布の差がある状態を表している。またここでの例示では説明変数として数値変数のみを挙げている。但し、カテゴリ変数に対してもone-hot encodingやtarget encoding、count encodingなどのエンコード手法を用いて数値変数に変換したのちに本実施例の処理内容を適用することができる。 In the example given here, the data belonging to institution A tends to have lower annual income, larger family size, and older age than the data from institution B, which represents the state of difference in data distribution between institutions that is the subject of this invention. In addition, in the example given here, only numerical variables are given as explanatory variables. However, the processing contents of this embodiment can also be applied to categorical variables after converting them into numerical variables using encoding methods such as one-hot encoding, target encoding, and count encoding.

次に、図14に、データ補正条件ファイル3100を示す。データ補正条件ファイルに3100は、データ補正条件、個別前処理用パラメータ、個別前処理内容が含まれる。ここでは図13のテーブル2000で挙げた例を使って各ファイルの内容を例示する。データ補正条件ファイル3100は例えば指標3101、及び各個別前処理手法を探索するか否かのフラグ3102~3105からなる。そしてこれらは、学習時に学習用クライアント部1321からデータ補正条件取得部11212に渡され、データ補正条件格納部11221に格納される。 Next, FIG. 14 shows the data correction condition file 3100. The data correction condition file 3100 includes data correction conditions, parameters for individual preprocessing, and individual preprocessing contents. Here, the contents of each file are illustrated using the example given in table 2000 in FIG. 13. The data correction condition file 3100 is composed of, for example, an index 3101, and flags 3102 to 3105 indicating whether or not to search for each individual preprocessing method. These are then passed from the learning client unit 1321 to the data correction condition acquisition unit 11212 during learning, and stored in the data correction condition storage unit 11221.

指標3101は、上述した指標値を格納するもので、予測に対する有用性を定量化するものである。このため、システム提供者によって実装された指標のうちから選択すればよく、ここで挙げた相関係数に限定されない。また、指標3101は、図6中の処理S505で用いられ、S505ではこの指標値を最大化する個別前処理手法が各対象説明変数ごとに選択される。 Indicator 3101 stores the above-mentioned index value and quantifies the usefulness for prediction. Therefore, it may be selected from among the indexes implemented by the system provider, and is not limited to the correlation coefficients listed here. In addition, indicator 3101 is used in process S505 in FIG. 6, and in S505, an individual preprocessing method that maximizes this index value is selected for each target explanatory variable.

次に、各個別前処理手法を探索するか否かのフラグ3102~3105は、システム提供者によって実装された個別前処理手法の一覧を表示すればよい。このため、このフラグはここに挙げたものに限定しない。 Next, flags 3102 to 3105 indicating whether to search for each individual preprocessing method may display a list of individual preprocessing methods implemented by the system provider. Therefore, this flag is not limited to those listed here.

また、ここで挙げた個別標準化3103は各機関ごとにデータから平均値を引き、標準偏差で割る操作の有無を示すフラグである。個別中心化3104は、各機関ごとにデータから平均値を引く操作の有無を示すフラグである。個別正規化3105は、各機関ごとにデータの最小値と最大値を用いてデータ値を0から1の間に収める操作の有無を示すフラグである。各個別前処理手法を探索するか否かのフラグ3102~3105は、図6中の処理S504で用いられる。このため、このフラグは、探索範囲となっている個別前処理の中で上述の指標3101を最大化する個別前処理手法が各対象説明変数ごとに選択される。ここで示した例においては、注目する説明変数の目的変数との相関係数が最大となる個別前処理手法を、個別前処理なし、個別標準化、個別中心化の3通りの中から選択する設定の場合を示している。 The individual standardization 3103 mentioned here is a flag indicating whether or not an operation is performed to subtract the average value from the data for each institution and divide by the standard deviation. The individual centering 3104 is a flag indicating whether or not an operation is performed to subtract the average value from the data for each institution. The individual normalization 3105 is a flag indicating whether or not an operation is performed to place the data value between 0 and 1 using the minimum and maximum values of the data for each institution. The flags 3102 to 3105 indicating whether or not to search for each individual preprocessing method are used in process S504 in FIG. 6. For this reason, this flag selects an individual preprocessing method that maximizes the above-mentioned index 3101 among the individual preprocessing methods that are the search range for each target explanatory variable. In the example shown here, the individual preprocessing method that maximizes the correlation coefficient between the explanatory variable of interest and the target variable is selected from three options: no individual preprocessing, individual standardization, and individual centering.

次に、個別前処理用パラメータファイル3200は、例えば個別前処理手法3201、機関名3202、説明変数ごとの個別前処理用パラメータ3203~3205からなる。そして、個別前処理用パラメータファイル3200は、学習時に図6中、個別前処理実行部11213の処理S501で計算され、処理S502で個別前処理用パラメータ格納部11222に格納される。また、個別前処理用パラメータファイル3200は、予測時には図10中、格納データ取得部12212の処理S901で読み出され、図11中、個別前処理実行部12213の処理S1003において個別前処理に使用される。この例では、例えば個別標準化に必要な平均、標準偏差の値を各機関、各説明変数ごとに記録している。具体的には、機関Aの年収(百万円)の平均が3.7、標準偏差が0.8機関Bの年収(百万円)の平均が6.0、標準偏差が1.9であることを示している。 Next, the individual preprocessing parameter file 3200 includes, for example, an individual preprocessing method 3201, an institution name 3202, and individual preprocessing parameters 3203 to 3205 for each explanatory variable. The individual preprocessing parameter file 3200 is calculated in step S501 of the individual preprocessing execution unit 11213 in FIG. 6 during learning, and stored in the individual preprocessing parameter storage unit 11222 in step S502. The individual preprocessing parameter file 3200 is read out in step S901 of the stored data acquisition unit 12212 in FIG. 10 during prediction, and is used for individual preprocessing in step S1003 of the individual preprocessing execution unit 12213 in FIG. 11. In this example, for example, the values of the average and standard deviation required for individual standardization are recorded for each institution and each explanatory variable. Specifically, it is shown that the average annual income (million yen) of institution A is 3.7, the standard deviation is 0.8, the average annual income (million yen) of institution B is 6.0, and the standard deviation is 1.9.

また、個別中心化は平均値を減ずるだけの操作であるため平均値のみが記録される。その他の個別前処理手法がデータ補正条件ファイル3100上で指定されていれば、必要に応じたパラメータが記載される。機関ごとに個別の平均値標準偏差などパラメータを用いて前処理を行うことで、各機関の特徴が均され、機械学習モデルの予測精度向上に寄与する。 In addition, since individual centering is an operation that simply subtracts the mean value, only the mean value is recorded. If other individual preprocessing methods are specified in the data correction condition file 3100, parameters will be recorded as necessary. By performing preprocessing using parameters such as the mean value and standard deviation that are individual to each institution, the characteristics of each institution are averaged out, which contributes to improving the predictive accuracy of the machine learning model.

個別前処理内容ファイル3300は、例えば個別前処理の対象となる説明変数3301、決定された最適個別前処理3302、記録されるパラメータの定義3303からなる。それぞれの最適個別前処理3302は、学習時に図6中、個別前処理実行部11213の処理S505で決定される。そして、最適個別前処理3302は、個別前処理内容格納部11223に格納される。また、最適個別前処理3302は、予測時には図10中、格納データ取得部12212の処理S901で読み出され、図11中、個別前処理実行部12213の処理S1003において個別前処理に使用される。記録されるパラメータの定義3303は各最適個別前処理3302に対しあらかじめ定まった、必要なパラメータの意味、順序を示し、個別前処理用パラメータファイル3200に記載された数値の定義を示す。以上で、本実施例で用いる各データについての説明を終了する。
（I/F(表示画面)）
次に、本実施例での各種I/F、つまり、表示画面ついて説明する。 The individual preprocessing content file 3300 includes, for example, explanatory variables 3301 to be subjected to individual preprocessing, a determined optimum individual preprocessing 3302, and a definition 3303 of parameters to be recorded. Each optimum individual preprocessing 3302 is determined in step S505 of the individual preprocessing execution unit 11213 in FIG. 6 during learning. The optimum individual preprocessing 3302 is then stored in the individual preprocessing content storage unit 11223. The optimum individual preprocessing 3302 is read out in step S901 of the stored data acquisition unit 12212 in FIG. 10 during prediction, and is used for individual preprocessing in step S1003 of the individual preprocessing execution unit 12213 in FIG. 11. The definition 3303 of parameters to be recorded indicates the meaning and order of necessary parameters that are predetermined for each optimum individual preprocessing 3302, and indicates the definition of the numerical values described in the individual preprocessing parameter file 3200. This concludes the explanation of each data used in this embodiment.
(I/F (display screen))
Next, various I/Fs, that is, display screens in this embodiment will be described.

まず、図15に学習時に学習用クライアント部1321により学習用クライアントマシン13の表示装置に表示するI/Fの一例を示す。なお、図1には、学習用クライアントマシン13の表示装置を図示しない。但し、学習用クライアントマシン13は、表示装置を有するか、接続している。このような表示装置を、学習用クライアントマシン13の表示装置とする。 First, FIG. 15 shows an example of an I/F displayed on the display device of the learning client machine 13 by the learning client unit 1321 during learning. Note that FIG. 1 does not show the display device of the learning client machine 13. However, the learning client machine 13 has a display device or is connected to it. Such a display device is referred to as the display device of the learning client machine 13.

図15において、入力I/F4100は、学習用クライアントマシン13のユーザーの操作の際に表示装置に表示され、学習データファイル入力欄4110、データ補正条件ファイル入力欄4120に対する入力を可能とする。このため、学習用クライアントマシン13のユーザーにより学習開始ボタン4130が押下されることで学習時の一連の処理を開始できる。 In FIG. 15, the input I/F 4100 is displayed on the display device when operated by the user of the learning client machine 13, and enables input to the learning data file input field 4110 and the data correction condition file input field 4120. Therefore, a series of processes during learning can be started by the user of the learning client machine 13 pressing the learning start button 4130.

学習データファイル入力欄4110は、選択された学習データファイル名の表示欄4111と、それを学習用クライアントマシン13中のメモリ1320または外部記憶装置1330などから検索するためのボタン4112からなる。データ補正条件入力欄4120は、選択されたデータ補正条件ファイル名の表示欄4121と、それを学習用クライアントマシン13中のメモリ1320または外部記憶装置1330などから検索するためのボタン4122からなる。なお、本実施例では画面による入出力を例に説明するがこの限りではない。例えばプログラム上のAPIとして同様の情報を入力してもよい。 The learning data file input field 4110 consists of a display field 4111 for the selected learning data file name and a button 4112 for searching for it from the memory 1320 in the learning client machine 13 or the external storage device 1330. The data correction condition input field 4120 consists of a display field 4121 for the selected data correction condition file name and a button 4122 for searching for it from the memory 1320 in the learning client machine 13 or the external storage device 1330. Note that, although the present embodiment uses input and output via a screen as an example, this is not limiting. For example, similar information may be input as an API in a program.

また、図15において、出力I/F4200は、学習用クライアントマシン13のユーザーの操作の際に、表示装置に表示される。そして、出力I/F4200は、各説明変数4201~4204毎に採用された補正内容4205を含む。さらに、出力I/F4200は、個別前処理に用いたパラメータ一覧4206、補正前の全機関結合データのヒストグラム4207、補正後の機関結合データのヒストグラム4208、プログラム終了用ボタン4209、再試行用ボタン4210も含む。ここで、補正内容4105とは、個別前処理手法を指す。 In FIG. 15, the output I/F 4200 is displayed on the display device when the user of the learning client machine 13 operates it. The output I/F 4200 includes the correction content 4205 adopted for each of the explanatory variables 4201-4204. The output I/F 4200 also includes a list 4206 of parameters used in the individual preprocessing, a histogram 4207 of the all-institution combined data before correction, a histogram 4208 of the institution combined data after correction, a button 4209 for ending the program, and a button 4210 for retrying. Here, the correction content 4105 refers to the individual preprocessing method.

なお、採用された個別前処理手法4205、個別前処理に用いたパラメータ一覧4206は、図14中で説明されたものの単なる図示であるので説明を省略する。 Note that the individual preprocessing method 4205 employed and the list of parameters 4206 used in the individual preprocessing are merely illustrations of the methods described in Figure 14, and so will not be described here.

また、補正前後の全機関結合データのヒストグラム4207、4208は各変数に対し、本実施例でなされた処理が課題となっていたデータ傾向の違いを適切に補正できているかの判断を可能にする。ユーザーは、表示された結果に満足すればプログラム終了用ボタン4209を押下して処理を完了することができる。また、ユーザーは、処理内容に不服であれば再試行用ボタン4210を押下することで入力時I/F4100に戻り、データ補正条件ファイル入力欄4120でデータ補正条件ファイルを変更するなどして試行錯誤することができる。 In addition, the histograms 4207, 4208 of the combined data for all institutions before and after correction make it possible to determine whether the processing performed in this embodiment has adequately corrected the differences in data trends, which have been an issue, for each variable. If the user is satisfied with the displayed results, they can press the end program button 4209 to complete the processing. If the user is not satisfied with the processing, they can return to the input I/F 4100 by pressing the retry button 4210, and try different things by changing the data correction condition file in the data correction condition file input field 4120.

次に、図16に、予測時に予測用クライアント部1421により予測用クライアントマシン14の表示装置に表示するI/Fの一例を示す。なお、図1には、予測用クライアントマシン14の表示装置を図示しない。但し、予測用クライアントマシン14は、表示装置を有するか、接続している。このような表示装置を、予測用クライアントマシン14の表示装置とする。 Next, FIG. 16 shows an example of an I/F displayed on the display device of the prediction client machine 14 by the prediction client unit 1421 during prediction. Note that FIG. 1 does not show the display device of the prediction client machine 14. However, the prediction client machine 14 has a display device or is connected to a display device. Such a display device is referred to as the display device of the prediction client machine 14.

図16において、入力I/F5100は、予測用クライアントマシン14のユーザーの操作の際に表示装置に表示され、予測対象データファイル入力欄5110に対する入力を可能とする。このため、予測用クライアントマシン14のユーザーにより、学習開始ボタン5120が押下されることで予測時の一連の処理を開始できる。なお、本実施例では画面による入出力を例に説明するがこの限りではない。例えば、プログラム上のAPIとして同様の情報を入力してもよい。 In FIG. 16, the input I/F 5100 is displayed on the display device when operated by the user of the prediction client machine 14, and enables input to the prediction target data file input field 5110. Therefore, the user of the prediction client machine 14 can start a series of processes during prediction by pressing the learning start button 5120. Note that, although the present embodiment will be described taking input and output via a screen as an example, this is not limited to this. For example, similar information may be input as an API in a program.

また、予測対象データファイル入力欄5110は、選択された予測対象データファイル名の表示欄5111と、それを予測用クライアントマシン14中のメモリ1420または外部記憶装置1430などから検索するためのボタン5112からなる。 The prediction target data file input field 5110 also includes a display field 5111 for the name of the selected prediction target data file, and a button 5112 for searching for it from the memory 1420 in the prediction client machine 14 or the external storage device 1430, etc.

また、図16において、出力I/F5200は、図15中、学習時の出力I/F4200とほぼ同様であるので重複部分5201~5210に関しての説明を省略する。学習時の出力I/F4200と異なる点として予測値を記載したファイルをダウンロードするボタン5211を有する。必要に応じて予測精度や予測誤差などを表示またはダウンロードするI/Fを備えてもよい。 In addition, in FIG. 16, output I/F 5200 is almost the same as output I/F 4200 during learning in FIG. 15, so a description of overlapping parts 5201 to 5210 will be omitted. The difference from output I/F 4200 during learning is that it has button 5211 for downloading a file containing predicted values. If necessary, an I/F for displaying or downloading prediction accuracy, prediction error, etc. may also be provided.

なお、学習データや補正手法など学習時の情報を開示したくない状況においては補正内容5205、補正用パラメータ5206、補正前予測用データ5207、補正後予測用データ5208の一部または全てを表示しなくてもよい。これは、予測用クライアントマシン14に対する予測用サーバ12の出力部12216の制御により実行される。 In addition, in a situation where it is not desirable to disclose information at the time of learning, such as the learning data or the correction method, it is not necessary to display some or all of the correction details 5205, the correction parameters 5206, the pre-correction prediction data 5207, and the post-correction prediction data 5208. This is performed by controlling the output unit 12216 of the prediction server 12 to the prediction client machine 14.

以上の説明において、複数機関からデータを集めて予測モデルを構築するケースに関して記載しているが、その限りではない。例えば複数の生産機械から収集したセンサデータや、複数の商品の顧客売買データなど、単一の機関内でも異なるデータ源から収集されたとみなせる場合には本発明の手法が適用できる。以降で説明する別実施例についても同様である。 In the above explanation, a case has been described in which a predictive model is constructed by collecting data from multiple institutions, but this is not the only case. For example, the method of the present invention can be applied to cases in which data can be considered to have been collected from different data sources within a single institution, such as sensor data collected from multiple production machines or customer trading data for multiple products. The same applies to other embodiments described below.

実施例1に示した形態は複数の予測対象データに対し一括で補正及び予測値の出力が可能であり、大量の予測対象データを短時間で処理する用途においては好適である。一方で、個別のデータに対する予測結果について詳細な知見を得ることができないという問題がある。例えば、入力した説明変数をどう変更すれば出力がどう変わるのかといったことを調べることができない。このことによって、以下の用途への応用が困難な場合もある。例えばマーケティングへの適用ケースではどういった広告を見せれば商品購入につながるかを調べるといった用途や、例えばセンサデータからの機器故障予測への適用ケースではどういった箇所を点検すれば故障の原因を特定できるかを調べるといった用途である。 The form shown in Example 1 is capable of performing correction and outputting predicted values for multiple prediction target data in a lump sum, and is suitable for applications in which a large amount of prediction target data is processed in a short time. On the other hand, there is a problem in that it is not possible to obtain detailed knowledge about the prediction results for individual data. For example, it is not possible to investigate how changing the input explanatory variables will change the output. This may make it difficult to apply the method to the following applications. For example, in a marketing application case, it is possible to investigate what kind of advertisements should be shown to lead to product purchases, and in a device failure prediction case from sensor data, it is possible to investigate what parts should be inspected to identify the cause of the failure.

そこで、本実施例2では予測時に一件のデータのみを入力できるようにし、そのデータ値を変更しながらインタラクティブに予測値などの出力を取得できるようにする。なお、本実施例では、一件のデータをしたが、所定数のデータとしてもよい。 Therefore, in this embodiment 2, only one piece of data can be input when making a prediction, and the data value can be changed to interactively obtain output such as a predicted value. Note that, although one piece of data is used in this embodiment, a predetermined number of pieces of data may also be used.

図17は、実施例2において、予測時に予測用クライアント部1421により予測用クライアントマシン14の表示装置に表示するI/Fの一例である。 Figure 17 shows an example of an I/F displayed on the display device of the prediction client machine 14 by the prediction client unit 1421 during prediction in the second embodiment.

図17において、入力I/F6100は、予測対象データ入力欄6110を含む。そして、予測対象データ入力欄6110は、データの所属する機関名6111、説明変数の値6112~6114を入力する欄を有する。予測用クライアントマシン14のユーザーにより予測開始ボタン6115が押下されることで、予測用クライアントマシン14の表示装置では、下記補正結果及び予測結果などの出力I/F6200を表示する。なお、予測用クライアントマシン14の表示装置では、予測開始ボタン6115を表示せず、6111~6114に入力を行った段階でリアルタイムに出力I/F6200を表示してもよい。 In FIG. 17, the input I/F 6100 includes a prediction target data input field 6110. The prediction target data input field 6110 has fields for inputting the name of the institution to which the data belongs 6111 and explanatory variable values 6112 to 6114. When a user of the prediction client machine 14 presses a prediction start button 6115, the display device of the prediction client machine 14 displays an output I/F 6200 including the correction results and prediction results described below. Note that the display device of the prediction client machine 14 may not display the prediction start button 6115, and may instead display the output I/F 6200 in real time at the stage when input is made to fields 6111 to 6114.

また、機関名6111は、例えば学習時に入力された機関名一覧から選択する形式とし、説明変数の値6112~6114は適切なボタンによって容易にその値を変更することができる。 The institution name 6111 can be selected from a list of institution names entered during learning, for example, and the explanatory variable values 6112 to 6114 can be easily changed using the appropriate buttons.

図17の出力I/F6200は、図15の学習時の出力I/F4200とほぼ同様であるので重複部分に関しての説明を省略する。出力I/F6200の学習時の出力I/F4200と異なる点として、入力値6205や補正後の値である補正値6207、出力予測値6211を表示することがある。図16の出力I/F5200と同様に、予測用クライアントマシン14に対し、学習データや補正手法など学習時の情報を開示したくない状況においては、出力部12216は以下の情報の表示を抑止できる。つまり、予測用クライアントマシン14の表示装置では、補正内容6206、補正値6207、補正用パラメータ6208、補正前予測用データ6209、補正後予測用データ6210の一部または全てを表示しなくてもよい。 The output I/F 6200 in FIG. 17 is almost the same as the output I/F 4200 during learning in FIG. 15, so a description of the overlapping parts will be omitted. The output I/F 6200 differs from the output I/F 4200 during learning in that it may display the input value 6205, the correction value 6207 which is the value after correction, and the output predicted value 6211. As with the output I/F 5200 in FIG. 16, in a situation where it is not desired to disclose information during learning, such as the learning data and the correction method, to the prediction client machine 14, the output unit 12216 can suppress the display of the following information. In other words, the display device of the prediction client machine 14 does not need to display some or all of the correction content 6206, the correction value 6207, the correction parameter 6208, the pre-correction prediction data 6209, and the post-correction prediction data 6210.

実施例1では、学習及び予測を一回のみ実施するため、高速に機械学習モデルの構築、予測ができる点で優れている。一方で予測モデルの性能は最後に予測値を取得するまで定かではなく、この予測値を見て例えば予測精度が所望の値に足りなかった場合、データ補正条件ファイルを修正するなどのフィードバックループを回す必要がある。こうした場合一般に機械学習モデルの構築においては、学習用データセットを分割することで検証用のデータセットを用意し、この検証データ上での精度を見ながら条件を調節することで予測モデルの性能を向上させる。 In Example 1, learning and prediction are performed only once, which is advantageous in that a machine learning model can be constructed and predictions can be made quickly. On the other hand, the performance of the prediction model is not certain until the final predicted value is obtained, and if, for example, the prediction accuracy of this predicted value is not sufficient as desired, a feedback loop must be run, such as correcting the data correction condition file. In such cases, when constructing a machine learning model, a validation dataset is generally prepared by dividing the learning dataset, and the performance of the prediction model is improved by adjusting the conditions while observing the accuracy on this validation data.

そこで、実施例3においては、学習用サーバ11は、学習時に複数のデータ補正条件ファイルを入力し、予測時に予測対象データの代わりに検証データを入力する。本実施例においては、実施例1で説明した一連の流れを入力したデータ補正条件ファイルの数だけ繰り返して予測精度を算出することで、各データ補正条件ファイルに対し検証データ上での予測精度を割り当てることができる。この検証データ精度が最大となるデータ補正条件ファイルを選択することで、予測精度の高いモデルを構築することができる。 Therefore, in Example 3, the learning server 11 inputs multiple data correction condition files during learning, and inputs verification data instead of the data to be predicted during prediction. In this example, the series of steps described in Example 1 is repeated the number of times as many as the number of input data correction condition files to calculate the prediction accuracy, thereby assigning a prediction accuracy on the verification data to each data correction condition file. By selecting the data correction condition file that maximizes the accuracy of this verification data, a model with high prediction accuracy can be constructed.

また、交差検証法でなされるように検証用ファイルの分割を複数パターン実行し、それらに対する平均精度が最も高いデータ補正条件ファイルを選択することで、より高精度かつ頑健なモデルを構築することもできる。 In addition, by dividing the validation file in multiple patterns, as is done in the cross-validation method, and selecting the data correction condition file with the highest average accuracy for those patterns, it is possible to build a more accurate and robust model.

以上の各実施例によれば、高精度な統合予測モデルを少ない工数で作成できるようになる。また、多様性のあるデータを用いて、機械学習モデルを構築することで、より正確な分析が可能になる。例えば、社会変動に対応できたり、個々の機関では収集できないデータ量を収集でき、より正確なデータ分析につなげることが可能になる。 According to each of the above embodiments, it becomes possible to create a highly accurate integrated prediction model with a small amount of work. In addition, by constructing a machine learning model using diverse data, more accurate analysis becomes possible. For example, it becomes possible to respond to social changes and collect a volume of data that cannot be collected by individual institutions, leading to more accurate data analysis.

1:計算機システム
11:学習用サーバ
12:予測用サーバ
13:学習用クライアントマシン
14:予測用クライアントマシン
11211:データ取得部
11212:データ補正条件取得部
11213:個別前処理実行部
11214:全機関データ集計部
11215:機械学習モデル学習部
11216:出力部
11221:データ補正条件格納部
11222:個別前処理用パラメータ格納部
11223:個別前処理内容格納部
11224:学習済み機械学習モデル格納部
12211:データ取得部
12212:格納データ取得部
12213:個別前処理実行部
12214:全機関データ集計部
12215:機械学習モデル予測部
12216:出力部
1321学習用クライアント部
1421:予測用クライアント部
2000:テーブル
3100:データ補正条件ファイル
3200:個別前処理用パラメータファイル
3300:個別前処理内容ファイル
4100:学習時入力I/F
4200:学習時出力I/F
5100:予測時入力I/F
5200:予測時出力I/F
6100:実施例2での予測時入力I/F
6200:実施例2での予測時出力I/F 1: Computer system
11: Learning server
12: Prediction server
13: Client machine for learning
14: Client machine for prediction
11211: Data Acquisition Department
11212: Data correction condition acquisition section
11213: Individual pre-processing execution unit
11214: All-institution data collection department
11215: Machine learning model training section
11216: Output section
11221: Data correction condition storage section
11222: Parameter storage for individual pre-processing
11223: Individual pre-processing content storage section
11224: Trained machine learning model storage
12211: Data Acquisition Department
12212: Stored data acquisition section
12213: Individual pre-processing execution unit
12214: All-institution data compilation section
12215: Machine learning model prediction section
12216: Output section
1321 Learning Client
1421: Prediction client part
2000: Table
3100: Data correction condition file
3200: Parameter file for individual pre-processing
3300: Individual pre-processing content file
4100: Learning input I/F
4200: Learning output I/F
5100: Prediction input I/F
5200: Prediction output I/F
6100: Input I/F for prediction in Example 2
6200: Output I/F for prediction in Example 2

Claims

データ分析のための学習モデルを構築する学習モデル構築システムにおいて、
データ分布に差がある複数機関から収集した複数機関データを記憶する記憶部と、
前記複数機関データに対するデータ前処理における前処理条件を格納する格納部と、
前記前処理条件を用いて、複数機関データそれぞれに対するデータ前処理に必要なパラメータを算出し、
前記データ分析における説明変数ごと、前記複数機関データそれぞれに対して、複数のパラメータそれぞれを用いた、第１のデータ前処理のそれぞれを実行し、
前記データ前処理ごとに、当該データ前処理が実行された各データを合算して、データ分析の結果である予測に対する有用性を示す指標値を算出し、
当該指標値に基づいて、前記複数機関データそれぞれに対するデータ前処理から、所定のデータ前処理を、前記説明変数ごとに特定し、
前記複数機関データそれぞれに対して、特定された前記説明変数ごとの第２のデータ前処理を実行する個別前処理実行部とを有することを特徴とする学習モデル構築システム。 In a learning model construction system for constructing a learning model for data analysis,
A storage unit that stores multi-agency data collected from multiple agencies having different data distributions;
A storage unit for storing preprocessing conditions for data preprocessing of the multi-institution data;
Using the pre-processing conditions, parameters required for data pre-processing for each of the data from multiple institutions are calculated;
performing a first data pre-processing step using a plurality of parameters for each explanatory variable in the data analysis and for each of the multiple institution data;
For each of the data pre-processing steps, add up the data on which the data pre-processing has been performed to calculate an index value indicating usefulness for prediction as a result of the data analysis;
Identifying a predetermined data pre-processing for each of the multiple institution data based on the index value, for each of the explanatory variables;
and an individual pre-processing execution unit that executes a second data pre-processing for each of the identified explanatory variables for each of the multiple institution data.

請求項１に記載の学習モデル構築システムにおいて、
前記個別前処理実行部は、前記所定のデータ前処理として、前記指標値が最大値を示すデータ前処理を特定することを特徴とする学習モデル構築システム。 2. The learning model construction system according to claim 1,
A learning model construction system characterized in that the individual preprocessing execution unit identifies, as the specified data preprocessing, the data preprocessing for which the index value shows the maximum value.

請求項２に記載の学習モデル構築システムにおいて、
前記個別前処理実行部は、前記指標値として、前記データ分析の結果と目的変数の相関係数を用いて、前記所定のデータ前処理を特定することを特徴とする学習モデル構築システム。 3. The learning model construction system according to claim 2,
A learning model construction system characterized in that the individual preprocessing execution unit identifies the specified data preprocessing using a correlation coefficient between the result of the data analysis and a dependent variable as the index value.

請求項１に記載の学習モデル構築システムにおいて、
前記個別前処理実行部は、前記第２のデータ前処理として、前記第１のデータ前処理の結果を流用することを特徴とする学習モデル構築システム。 2. The learning model construction system according to claim 1,
A learning model construction system characterized in that the individual preprocessing execution unit reuses the results of the first data preprocessing as the second data preprocessing.

請求項１に記載の学習モデル構築システムにおいて、
前記第２のデータ前処理が実行された複数機関データを結合するデータ集計部と、
結合された前記複数機関データを用いて、学習モデルを構築する機械学習モデル学習部とをさらに有することを特徴とする学習モデル構築システム。 2. The learning model construction system according to claim 1,
a data aggregation unit that combines the data from multiple institutions that have been subjected to the second data pre-processing;
A learning model construction system further comprising a machine learning model learning unit that constructs a learning model using the combined data from multiple institutions.

請求項５に記載の学習モデル構築システムにおいて、
予測対象となる複数機関データそれぞれに対して、特定された前記説明変数ごとの第３のデータ前処理を実行する予測時個別前処理実行部と、
前記第３のデータ前処理が実行された複数機関データを、前記学習モデルに入力して、予測を実行する機械学習モデル予測部とを有する予測装置をさらに備える学習モデル構築システム。 6. The learning model construction system according to claim 5,
a prediction-time individual pre-processing execution unit that executes a third data pre-processing for each of the multiple organization data to be predicted for each of the identified explanatory variables;
A learning model construction system further comprising a prediction device having a machine learning model prediction unit that inputs the multi-institution data on which the third data preprocessing has been performed into the learning model and performs prediction.

データ分析のための学習モデルを構築する学習モデル構築システムを用いた学習モデル構築方法において、
記憶部に、データ分布に差がある複数機関データを格納し、
格納部に、前記複数機関データに対するデータ前処理における前処理条件を格納する格納し、
前記前処理条件を用いて、複数複数機関データそれぞれに対するデータ前処理に必要なパラメータを算出し、
前記データ分析における説明変数ごと、前記複数機関データそれぞれに対して、複数のパラメータそれぞれを用いた、第１のデータ前処理のそれぞれを実行し、
前記データ前処理ごとに、当該データ前処理が実行された各データを合算して、データ分析の結果である予測に対する有用性を示す指標値を算出し、
当該指標値に基づいて、前記複数機関データそれぞれに対するデータ前処理から、所定のデータ前処理を、前記説明変数ごとに特定し、
前記複数機関データそれぞれに対して、特定された前記説明変数ごとの第２のデータ前処理を実行することを特徴とする学習モデル構築方法。 A learning model construction method using a learning model construction system for constructing a learning model for data analysis,
The memory unit stores data from multiple institutions with different data distributions.
A storage unit stores preprocessing conditions for the multi-institution data;
Using the preprocessing conditions, parameters required for data preprocessing for each of the multiple multi-institution data are calculated;
performing a first data pre-processing step using a plurality of parameters for each explanatory variable in the data analysis and for each of the multiple institution data;
For each of the data pre-processing steps, add up the data on which the data pre-processing step has been performed to calculate an index value indicating usefulness for prediction as a result of the data analysis;
Identifying a predetermined data pre-processing for each of the multiple institution data based on the index value, for each of the explanatory variables;
A learning model construction method comprising: performing a second data preprocessing for each of the multiple institution data for each of the identified explanatory variables.

請求項７に記載の学習モデル構築方法において、
前記所定のデータ前処理として、前記指標値が最大値を示すデータ前処理を特定することを特徴とする学習モデル構築方法。 The learning model construction method according to claim 7,
A learning model construction method, characterized in that as the specified data preprocessing, a data preprocessing that shows the maximum index value is specified.

請求項８に記載の学習モデル構築方法において、
前記指標値として、前記データ分析の結果と目的変数の相関係数を用いて、前記所定のデータ前処理を特定することを特徴とする学習モデル構築方法。 9. The learning model construction method according to claim 8,
A learning model construction method, characterized in that the specified data preprocessing is identified using a correlation coefficient between the result of the data analysis and a dependent variable as the index value.

請求項７に記載の学習モデル構築方法において、
前記第２のデータ前処理として、前記第１のデータ前処理の結果を流用することを特徴とする学習モデル構築方法。 The learning model construction method according to claim 7,
A learning model construction method, characterized in that the results of the first data preprocessing are reused as the second data preprocessing.

請求項７に記載の学習モデル構築方法において、
さらに、
前記第２のデータ前処理が実行された複数機関データを結合し、
結合された前記複数機関データを用いて、学習モデルを構築することを特徴とする学習モデル構築方法。 The learning model construction method according to claim 7,
moreover,
Combining the multi-agency data on which the second data pre-processing has been performed;
A learning model construction method, comprising constructing a learning model using the combined data from multiple institutions.

請求項１１に記載の学習モデル構築方法において、
前記学習モデル構築システムが備える予測装置は、
予測対象となる複数機関データそれぞれに対して、特定された前記説明変数ごとの第３のデータ前処理を実行し、
前記第３のデータ前処理が実行された複数機関データを、前記学習モデルに入力して、予測を実行することを特徴とする学習モデル構築方法。 The learning model construction method according to claim 11,
A prediction device provided in the learning model construction system,
A third data pre-processing is performed for each of the multiple institution data to be predicted for each of the identified explanatory variables;
A learning model construction method, comprising inputting the multi-institution data on which the third data preprocessing has been performed into the learning model to perform predictions.