JP2024030579A

JP2024030579A - Information processing method, information processing system, and information processing program

Info

Publication number: JP2024030579A
Application number: JP2022133545A
Authority: JP
Inventors: 理敏関根; Takatoshi Sekine
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2024-03-07
Also published as: WO2024042736A1

Abstract

PROBLEM TO BE SOLVED: To increase interpretability of attribute values and attributes of training data and test data for a learning model.

SOLUTION: Regression models, in which latent variables output from MFCVAE in response to input of input data including reference data having a plurality of attributes of data assigned with attribute values are defined as explanatory variables and the attribute values are defined as objective variables, are set for the respective attributes. From the latent variables and the attribute values, prediction values of the attribute values at which prediction errors with respect to the attribute values become minimum and regression coefficients of the regression models are calculated for the respective attributes. On the basis of the prediction values and the regression coefficients for the respective attributes, function values of loss functions, of the MFCVAE, obtained by adding additional terms based on indices which are for the respective attributes and of which the values become smaller as the adaptation of the latent variables and the attribute values to the regression models becomes better are calculated. A model parameter of the MFCVAE is updated by error backpropagation based on the function values. As described above, model training of MFCVAE is executed.

SELECTED DRAWING: Figure 9

Description

本発明は、情報処理方法、情報処理システム、及び情報処理プログラムに関する。 The present invention relates to an information processing method, an information processing system, and an information processing program.

ＡＩ（Artificial Intelligence）モデルの品質は、利用するデータの品質に依存する。ＡＩモデルの品質を保証するためには、ＡＩモデルを構築する際の訓練データや推論を行う際のテストデータが有する属性に関する情報を評価することが有用である。そのために、変分オートエンコーダ技術を用いて、学習モデルのエンコーダから潜在変数（特徴量）を抽出し、データの属性情報の内容を明らかにしたり、お互い類似する属性情報をもつデータを抽出したりすること等が行われている。 The quality of an AI (Artificial Intelligence) model depends on the quality of the data used. In order to guarantee the quality of an AI model, it is useful to evaluate information regarding attributes of training data used when building an AI model and test data used when making inferences. To this end, we use variational autoencoder technology to extract latent variables (features) from the encoder of the learning model, clarify the content of data attribute information, and extract data with similar attribute information. Things are being done.

例えば特許文献１では、訓練データから同一のセマンティック特徴に対応する３つの画像を抽出し、３つの画像の各画像について、セマンティック特徴に対応する潜在変数の損失関数を最小化するように変分オートエンコーダのパラメータを更新する。これにより、同一のセマンティック特徴を有する異なる画像の識別性を高めている。 For example, in Patent Document 1, three images corresponding to the same semantic feature are extracted from training data, and variational auto-processing is performed to minimize the loss function of the latent variable corresponding to the semantic feature for each of the three images. Update encoder parameters. This improves the identifiability of different images having the same semantic features.

また例えば非特許文献１では、各潜在変数が入力値に対して与える情報が一意となるように潜在変数の独立性を高めることで、潜在変数の変化に対応する属性の内容やその大きさの変化の解釈性を高めている。その結果、例えば手書き文字データにおいて、ある潜在変数の変化に対して、文字の角度が左斜めから右斜めへ連続変化することが分かる。 For example, in Non-Patent Document 1, by increasing the independence of latent variables so that the information given by each latent variable to the input value is unique, the content and size of attributes corresponding to changes in latent variables can be changed. It increases the interpretability of changes. As a result, it can be seen that, for example, in handwritten character data, the angle of the character changes continuously from diagonally left to diagonally right in response to a change in a certain latent variable.

また例えば非特許文献２では、直行する基底の線形結合で潜在変数を表現し、学習によって得られた基底の係数とデータの属性の変化とを対応付けることで、基底の係数の変化に対応する属性の内容やその大きさの変化の解釈性を高めている。その結果、例えば顔画像データにおいて、ある基底の係数の変化に対して、髪の毛の色が金色から黒色に連続変化することが分かる。 For example, in Non-Patent Document 2, a latent variable is expressed as a linear combination of orthogonal bases, and by associating the coefficients of the base obtained through learning with changes in the attributes of data, the attributes corresponding to changes in the coefficients of the base are This increases the interpretability of changes in the content and magnitude of the changes. As a result, it can be seen that, for example, in face image data, the color of hair changes continuously from gold to black in response to a change in a certain base coefficient.

特開２０１９－７５１０８号公報JP 2019-75108 Publication

Shuyang Gao，Rob Brekelmans，Greg Ver Steeg，Aram Galstyan，“Auto-Encoding Total CorrelationExplanation，” Proceedings of the 22nd International Conference on ArtificialIntelligence and Statistics pages 1157-1166.，[online]，Proceedings of Machine Learning Research (PMLR) 2019.，［令和４年８月１日検索］，インターネット＜URL：https://arxiv.org/abs/1802.05822＞Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, Aram Galstyan, “Auto-Encoding Total CorrelationExplanation,” Proceedings of the 22nd International Conference on ArtificialIntelligence and Statistics pages 1157-1166., [online], Proceedings of Machine Learning Research (PMLR) 2019 ., [Retrieved August 1, 2020], Internet <URL: https://arxiv.org/abs/1802.05822> Jin-Young Kim, Sung-Bae Cho，“BasisVAE: Orthogonal Latent Space for Deep DisentangledRepresentation，” [online]， International Conference onLearning Representations (ICLR) 2020.，［令和４年８月１日検索］，インターネット＜URL：https://arxiv.org/abs/1802.05822＞Jin-Young Kim, Sung-Bae Cho, “BasisVAE: Orthogonal Latent Space for Deep DisentangledRepresentation,” [online], International Conference on Learning Representations (ICLR) 2020., [Retrieved August 1, 2020], Internet < URL : https://arxiv.org/abs/1802.05822＞

しかしながら上述の従来技術では、潜在変数と、それに対応する属性や属性値がユーザの解釈に依存し、定性的にしか評価できないため、データの属性と属性値の解釈性が依然として低いという問題があった。 However, in the above-mentioned conventional technology, the latent variables and their corresponding attributes and attribute values depend on the user's interpretation and can only be evaluated qualitatively, so there is still a problem that the interpretability of data attributes and attribute values is low. Ta.

本願の開示の一側面では、訓練データやテストデータの潜在変数に対応する属性と属性値の解釈性を高めることを目的とする。 One aspect of the disclosure of the present application aims to improve the interpretability of attributes and attribute values corresponding to latent variables in training data and test data.

本願の開示の一側面では、処理部と記憶部とを有する情報処理システムが実行する情報処理方法であって、前記処理部が、データの複数の属性に属性値が付与されている基準データを含んだ入力データを、該データの前記複数の属性のそれぞれに関する潜在変数を出力するＭＦＣＶＡＥ（Multi-Facet Clustering Variational Auto-Encoder）に入力する第１ステップと、前記入力データの入力に対して前記ＭＦＣＶＡＥから出力された前記潜在変数を説明変数とし、前記属性値を目的変数とする回帰モデルを前記属性毎に設定する第２ステップと、前記潜在変数と前記属性値とから、該属性値に対する予測誤差が最小となる前記属性値の予測値及び前記回帰モデルの回帰係数を前記属性毎に算出する第３ステップと、前記第３ステップによって算出された前記属性毎の前記予測値及び前記回帰係数に基づいて、前記潜在変数及び前記属性値の前記回帰モデルへの適合が良いほど小さい値を取る指標を前記属性毎に算出する第４ステップと、前記ＭＦＣＶＡＥによるデータ再構成の誤差を表す再構成誤差項と、前記潜在変数の分布に制約を与える正則化項と、を有する前記ＭＦＣＶＡＥの損失関数に、前記属性毎の前記指標に基づく追加項を追加した損失関数の関数値を算出する第５ステップと、前記第５ステップによって算出された前記関数値に基づく誤差逆伝搬によって前記ＭＦＣＶＡＥのモデルパラメータを更新する第６ステップと、を実行し、前記第１ステップから前記第６ステップまでを、前記予測誤差又はエポック回数が所定条件を充足するまでこの順序で繰り返すことで前記ＭＦＣＶＡＥのモデル学習を実行する、ことを特徴とする。 One aspect of the disclosure of the present application is an information processing method executed by an information processing system having a processing unit and a storage unit, wherein the processing unit processes reference data in which attribute values are assigned to a plurality of attributes of the data. a first step of inputting the included input data to an MFCVAE (Multi-Facet Clustering Variational Auto-Encoder) that outputs latent variables regarding each of the plurality of attributes of the data; a second step of setting a regression model for each attribute, using the latent variables outputted from as explanatory variables and the attribute values as objective variables, and calculating the prediction error for the attribute values from the latent variables and the attribute values. a third step of calculating, for each attribute, a predicted value of the attribute value and a regression coefficient of the regression model, which minimizes a fourth step of calculating, for each attribute, an index that takes a smaller value as the latent variable and the attribute value fit better to the regression model; and a reconstruction error term representing an error in data reconstruction by the MFCVAE. and a regularization term that constrains the distribution of the latent variable, and a fifth step of calculating a function value of a loss function obtained by adding an additional term based on the index for each attribute to the loss function of the MFCVAE, which has the following: , a sixth step of updating the model parameters of the MFCVAE by error backpropagation based on the function value calculated in the fifth step, and from the first step to the sixth step, the prediction error is Alternatively, the MFCVAE model learning is performed by repeating this order in this order until the number of epochs satisfies a predetermined condition.

本願の開示の一側面によれば、学習モデルの訓練データやテストデータの属性と属性値の解釈性を高めることができる。前述した以外の課題、構成及び効果は、以下の実施形態の説明により明らかにされる。 According to one aspect of the disclosure of the present application, it is possible to improve the interpretability of attributes and attribute values of training data and test data of a learning model. Problems, configurations, and effects other than those described above will be made clear by the following description of the embodiments.

従来技術（ＭＦＣＶＡＥ）の問題点を説明するための図。FIG. 3 is a diagram for explaining the problems of the conventional technology (MFCVAE). 基準データと評価データ（文字データの場合）を示す図。A diagram showing reference data and evaluation data (in the case of character data). 基準データと評価データ（一般データの場合）を示す図。A diagram showing standard data and evaluation data (in the case of general data). 実施形態１に係る情報処理システムのモデル学習時の動作を説明するための図。FIG. 3 is a diagram for explaining the operation of the information processing system according to the first embodiment during model learning. 実施形態１に係る情報処理システムの評価データに対する属性値付与時の動作を説明するための図。FIG. 3 is a diagram for explaining the operation of the information processing system according to the first embodiment when assigning attribute values to evaluation data. 実施形態１に係る情報処理システムの属性値を指定したデータ生成時の動作を説明するための図。FIG. 3 is a diagram for explaining the operation of the information processing system according to the first embodiment when generating data with specified attribute values. 実施形態１に係る情報処理システムのモデル学習時の決定係数の推移を説明するための図。FIG. 3 is a diagram for explaining the transition of the coefficient of determination during model learning of the information processing system according to the first embodiment. 実施形態１に係る情報処理システムの構成を示すブロック図。FIG. 1 is a block diagram showing the configuration of an information processing system according to a first embodiment. 実施形態１に係る特徴量抽出処理を示すフローチャート。5 is a flowchart showing feature amount extraction processing according to the first embodiment. 実施形態１に係る属性値付与処理を示すフローチャート。7 is a flowchart showing attribute value assignment processing according to the first embodiment. 実施形態１に係るデータ生成処理を示すフローチャート。5 is a flowchart showing data generation processing according to the first embodiment. 実施形態１に係るデータ品質評価処理を示すフローチャート。7 is a flowchart showing data quality evaluation processing according to the first embodiment. 実施形態１に係る属性及び属性値の出力例１（データに対する属性及び属性値）を示す図。3 is a diagram showing an example 1 of outputting attributes and attribute values (attributes and attribute values for data) according to the first embodiment; FIG. 実施形態１に係るデータ、属性、及び属性値の出力例２（各属性及び属性値に対するデータ数）を示す図。FIG. 7 is a diagram showing an example 2 of outputting data, attributes, and attribute values (the number of data for each attribute and attribute value) according to the first embodiment; 実施形態２に係る情報処理システムの構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of an information processing system according to a second embodiment. コンピュータのハードウェア構成を示す図。A diagram showing the hardware configuration of a computer.

以下、図面を参照して本願の開示に係る実施形態を説明する。実施形態は、図面も含めて本願を説明するための例示である。実施形態では、説明の明確化のため、適宜、省略及び簡略化がされている。特に限定しない限り、実施形態の構成要素は単数でも複数でもよい。また、ある実施形態と他の実施形態を組み合わせた形態も、本願に係る実施形態に含まれる。 Hereinafter, embodiments according to the disclosure of the present application will be described with reference to the drawings. The embodiments, including the drawings, are examples for explaining the present application. In the embodiments, omissions and simplifications are appropriately made for clarity of explanation. Unless otherwise specified, the constituent elements of the embodiments may be singular or plural. Further, embodiments according to the present application also include combinations of certain embodiments and other embodiments.

同一又は類似の構成要素には、同一の符号を付与し、既出に対する後出の実施形態での説明を省略する、又は差分を中心とした説明のみを行う場合がある。また、同一又は類似の構成要素が複数ある場合には、同一の符号に異なる添字を付して説明する場合がある。また、これらの複数の構成要素を区別する必要がない場合には、添字を省略して説明する場合がある。 Identical or similar constituent elements may be given the same reference numerals, and the explanation of the previously described components in the later embodiments may be omitted, or only the differences may be explained. Furthermore, when there are a plurality of identical or similar constituent elements, the same reference numerals may be given different subscripts for explanation. Furthermore, if there is no need to distinguish between these multiple components, the subscripts may be omitted from the description.

以下の実施形態では、各種情報をテーブル形式で説明するが、各種情報はテーブル形式以外のデータ形式であってもよい。また、例えば、「ＸＸ情報」「ＸＸテーブル」「ＸＸリスト」「ＸＸキュー」等の各種呼称は、これらは互換可能である。例えば「ＸＸテーブル」は、「ＸＸリスト」と呼んでもよい。また、識別情報について説明する際に、「識別情報」「識別子」「名」「ＩＤ」「番号」等の表現を用いるが、これらは互換可能である。 In the following embodiments, various information will be explained in a table format, but the various information may be in a data format other than the table format. Furthermore, various names such as "XX information", "XX table", "XX list", and "XX queue" are interchangeable. For example, "XX table" may be called "XX list". Furthermore, when describing identification information, expressions such as "identification information," "identifier," "name," "ID," and "number" are used, but these are interchangeable.

（従来技術の問題点）
実施形態の説明に先立ち、実施形態が前提とする従来技術（ＭＦＣＶＡＥ：Multi-Facet Clustering Variational Autoencoders）の問題点について説明する。図１は、従来技術の問題点を説明するための図である。ＭＦＣＶＡＥは、複数の観点での潜在変数を出力可能な、拡張された変分オートエンコーダ（ＶＡＥ：Variational Auto-Encoder）である。変分オートエンコーダとは、ニューラルネットワークを使い、潜在変数の空間として確率分布を仮定した生成モデルである。ＭＦＣＶＡＥにおける観点とは、ＭＦＣＶＡＥが出力する潜在変数（ベクトル）の種類であり、文字データの例では「文字の種類」「字形（太さ、角度等）」等が該当する。 (Problems with conventional technology)
Prior to describing the embodiment, problems with the conventional technology (MFCVAE: Multi-Facet Clustering Variational Autoencoders) on which the embodiment is based will be explained. FIG. 1 is a diagram for explaining problems in the prior art. MFCVAE is an extended variational auto-encoder (VAE) that can output latent variables from multiple viewpoints. A variational autoencoder is a generative model that uses a neural network and assumes a probability distribution as the space of latent variables. The viewpoint in MFCVAE is the type of latent variable (vector) output by MFCVAE, and in the case of character data, it corresponds to "type of character", "character shape (thickness, angle, etc.)", etc.

なお、変分オートエンコーダは、文献１「Diederik P Kingma, Max Welling,“Auto-Encoding Variational Bayes,” May 2014.，［令和４年８月１日検索］，インターネット＜URL：https://arxiv.org/abs/1312.6114＞」に開示されている。ＭＦＣＶＡＥは、文献２「Fabian Falck et.al, “Multi-Facet Clustering Variational Autoencoders, Oct. 2021.［令和４年８月１日検索］，インターネット＜URL：https://arxiv.org/abs/2106.05241＞」に開示されている。 The variational autoencoder can be found in Document 1 “Diederik P Kingma, Max Welling, “Auto-Encoding Variational Bayes,” May 2014., [Retrieved August 1, 2020], Internet <URL: https:// arxiv.org/abs/1312.6114＞” MFCVAE is available in Document 2 “Fabian Falck et.al, “Multi-Facet Clustering Variational Autoencoders, Oct. 2021. [Retrieved August 1, 2020], Internet <URL: https://arxiv.org/abs/ 2106.05241>”.

以下、特徴とは、データを特徴づける情報（属性、属性値、潜在変数等）である。特徴量とは、定量的に表現可能な特徴の値である。属性とは、データを特徴づける性質（文字データの例では「太さ」「傾き」「ノイズ量」「文字の砕け度合い」等）である。属性値とは、属性の度合いを示す値（文字データの例では属性「太さ」に対する「１ｍｍ」、属性「傾き」に対する「１０度」、属性「ノイズ量」に対する「１０％」、属性「文字の砕け度合い」に対する「レベル２」等）である。属性値は、連続値でも離散値でもよい。潜在変数とは、変分オートエンコーダ関連技術において、エンコーダから出力される特徴量である。変分オートエンコーダ関連技術とは、ＶＡＥやＭＦＣＶＡＥを含む変分ベイズアルゴリズムを有する変分オートエンコーダ技術全般を指す。 Hereinafter, the term "feature" refers to information (attributes, attribute values, latent variables, etc.) that characterizes data. A feature amount is a value of a feature that can be expressed quantitatively. Attributes are properties that characterize data (in the case of character data, such as "thickness", "slope", "amount of noise", "degree of character breakage", etc.). Attribute value is a value indicating the degree of the attribute (in the example of character data, "1 mm" for the attribute "Thickness", "10 degrees" for the attribute "Tilt", "10%" for the attribute "Noise amount", "Level 2" for "degree of brokenness of letters"). Attribute values may be continuous values or discrete values. A latent variable is a feature quantity output from an encoder in variational autoencoder related technology. Variational autoencoder-related technology refers to all variational autoencoder technologies having variational Bayes algorithms, including VAE and MFCVAE.

ＭＦＣＶＡＥは、一つの潜在変数の変化に対して、複数の属性の属性値が変化する。このため、ある潜在変数の変化に対して変化する属性の対応付けが困難であった。図１を参照して、手書き文字の場合について属性及び属性値を例に説明する。 In MFCVAE, the attribute values of multiple attributes change in response to a change in one latent variable. For this reason, it has been difficult to associate attributes that change with changes in a certain latent variable. Referring to FIG. 1, the case of handwritten characters will be explained using an example of attributes and attribute values.

図１に示すように、潜在変数１を横軸、潜在変数２を縦軸に取った座標系において、潜在変数１及び潜在変数２のグループ１０１は、属性「太さ」が細い文字に対応した潜在変数のグループである。グループ１０２は、属性「太さ」が中程度の文字に対応した潜在変数のグループである。グループ１０３は、属性「太さ」が太い文字に対応した潜在変数のグループである。グループ１０４は、属性「ノイズ量」が少ない文字に対応した潜在変数のグループである。グループ１０５は、属性「ノイズ量」が中程度の文字に対応した潜在変数のグループである。グループ１０６は、属性「ノイズ量」が多い文字に対応した潜在変数のグループである。ここで属性「太さ」の「太い」「中程度」「細い」、及び属性「ノイズ量」の「多い」「中程度」「少ない」は例示的表現に過ぎず、定量的な表現又はこれに付与したラベルの一例である。 As shown in Figure 1, in a coordinate system with latent variable 1 on the horizontal axis and latent variable 2 on the vertical axis, the group 101 of latent variable 1 and latent variable 2 corresponds to characters whose attribute "thickness" is thin. A group of latent variables. Group 102 is a group of latent variables that correspond to characters whose attribute "thickness" is medium. Group 103 is a group of latent variables whose attribute "thickness" corresponds to bold characters. Group 104 is a group of latent variables corresponding to characters with a small attribute "noise amount". Group 105 is a group of latent variables corresponding to characters whose attribute "noise amount" is medium. Group 106 is a group of latent variables corresponding to characters with a large attribute "noise amount". Here, "thick," "medium," and "thin" for the attribute "thickness," and "large," "medium," and "small" for the attribute "noise amount" are merely illustrative expressions, and quantitative expressions or This is an example of a label given to .

図１に示す例では、潜在変数の変化に対して、全ての属性値が一様に増加あるいは減少するものではない。よって、ある潜在変数の変化に対する属性値の変化の対応付けが困難である。例えば図１では、潜在変数１の値が小さいグループ１０１からグループ１０３、グループ１０２へと二段階にわたり増加すると、属性「太さ」の属性値が「細い」、「太い」、「中程度」と変化する。しかし、潜在変数１の値の一様な増加に対して、属性値の変化が一様でない。潜在変数１の変化に関する属性「ノイズ量」の属性値の変化も同様である。その結果、ある潜在変数の変化に対する属性値の変化の解釈がしにくい。 In the example shown in FIG. 1, all attribute values do not uniformly increase or decrease in response to changes in latent variables. Therefore, it is difficult to associate changes in attribute values with changes in a certain latent variable. For example, in Figure 1, when the value of latent variable 1 increases in two stages from small group 101 to group 103 and group 102, the attribute value of the attribute "thickness" changes to "thin", "thick", and "medium". Change. However, while the value of latent variable 1 increases uniformly, the change in attribute value is not uniform. The same applies to changes in the attribute value of the attribute "noise amount" regarding changes in latent variable 1. As a result, it is difficult to interpret changes in attribute values in response to changes in a certain latent variable.

一方、潜在変数２の値が小さいグループ１０３からグループ１０２、グループ１０１へと二段階にわたり増加すると、属性「太さ」の属性値が「太い」、「中程度」、「細い」と変化するように、潜在変数１の値の一様な増加に対して、属性値が一様に減少する。潜在変数２に関する属性「ノイズ量」の属性値の変化も同様である。その結果、潜在変数の変化に対して、属性値が一様に減少するため、ある潜在変数の変化に対する属性値の変化の解釈がしやすい。 On the other hand, when the value of latent variable 2 increases in two stages from small group 103 to group 102 and group 101, the attribute value of the attribute "thickness" changes from "thick" to "medium" to "thin". In other words, for a uniform increase in the value of latent variable 1, the attribute value uniformly decreases. The same applies to changes in the attribute value of the attribute "noise amount" regarding latent variable 2. As a result, attribute values uniformly decrease in response to changes in latent variables, making it easy to interpret changes in attribute values in response to changes in a certain latent variable.

（基準データと評価データ）
先ず、文字データの基準データと評価データを説明する。図２は、基準データと評価データ（文字データの場合）を示す図である。図２において、各行をデータという。それぞれのデータに対して、「文字の種類」「太さ」「傾き」といった各属性について、属性値が格納されているデータと格納されていないデータがある。「文字の種類」は該当の文字のイメージデータである。 (Standard data and evaluation data)
First, reference data and evaluation data of character data will be explained. FIG. 2 is a diagram showing reference data and evaluation data (in the case of character data). In FIG. 2, each row is called data. For each data, for each attribute such as "character type,""thickness," and "slant," there are data in which attribute values are stored and data in which attribute values are not stored. “Character type” is image data of the corresponding character.

ある属性において、属性値が格納されているデータが、その属性に関する基準データであり、属性値が格納されていないデータが、その属性に関する評価データである。基準データは、評価データの属性値を求めるために用いられる属性値が既知のデータである。評価データは、属性値が未知の属性に関して属性値を求めて付与される対象のデータである。 For a certain attribute, data in which an attribute value is stored is reference data regarding that attribute, and data in which no attribute value is stored is evaluation data regarding that attribute. The reference data is data whose attribute values are known and are used to determine the attribute values of the evaluation data. The evaluation data is target data that is assigned by determining an attribute value for an attribute whose attribute value is unknown.

文字データの場合、手書き文字データに対して、一般に「Ａ」や「Ｂ」といった文字の種類の属性値を付与するのは容易であるが、「太さ」や「傾き」などの属性値を付与するのは容易ではない。そこでゴシック体といった「太さ」や「傾き」といった属性の属性値が変更可能である活字を用いて、変分オートエンコーダのモデルを学習する。 In the case of character data, it is generally easy to assign character type attribute values such as "A" and "B" to handwritten character data, but it is difficult to assign attribute values such as "thickness" and "slant". It is not easy to grant. Therefore, we learn a variational autoencoder model using typefaces such as Gothic fonts, which have variable attribute values such as "thickness" and "slant."

図２の例では「太さ」や「傾き」等の属性値を持つ「データ属性」“活字”の「データ番号」“１”“２”“３”のデータが基準データ、属性値を持たない「データ属性」“手書き文字”「データ番号」“４”“５”のデータが評価データとなる。より一般には、基準データは活字及び手書き文字を含み、評価データは手書き文字を含む。 In the example in Figure 2, the data of "data number" "1" "2" "3" of "data attribute" "print" which has attribute values such as "thickness" and "slant" are standard data and have attribute values. Data with "data attributes", "handwritten characters", "data numbers", "4" and "5" that are not included are evaluation data. More generally, the reference data includes printed and handwritten text, and the evaluation data includes handwritten text.

次に、一般データの基準データと評価データを説明する。図３は、基準データと評価データ（一般データの場合）を示す図である。図２では、データ属性毎に基準データと評価データが分かれていた。図３では基準データと評価データを一般化し、それぞれのデータに対して、各属性に関して属性値が格納されているデータが該当の属性に関する基準データであり、各属性に関して属性値が格納されていないデータが該当の属性に関する評価データである。 Next, reference data and evaluation data of general data will be explained. FIG. 3 is a diagram showing reference data and evaluation data (in the case of general data). In FIG. 2, reference data and evaluation data are separated for each data attribute. In Figure 3, the standard data and evaluation data are generalized, and for each data, the data in which the attribute value is stored for each attribute is the standard data for the corresponding attribute, and the data in which the attribute value is stored for each attribute is the standard data. The data is evaluation data regarding the corresponding attribute.

「属性１」に関して、基準データは「データ番号」“１”“２”“３”“４”のデータであり、評価データは「データ番号」“５”のデータである。同様に、「属性２」に関して、基準データは「データ番号」“１”“２”“５”のデータであり、評価データは「データ番号」“３”“４”のデータである。 Regarding "attribute 1", the reference data is the data of "data number" "1", "2", "3", "4", and the evaluation data is the data of "data number" "5". Similarly, regarding "attribute 2", the reference data is the data of "data number" "1", "2", and "5", and the evaluation data is the data of "data number" "3" and "4".

以下の実施形態の目的は、各属性において、基準データの属性値を利用して評価データの属性値を推定することである。 The purpose of the following embodiment is to estimate the attribute value of evaluation data in each attribute using the attribute value of reference data.

［実施形態１］
（モデル学習時及びデータ再構成時の動作）
図４は、実施形態１に係る情報処理システム１のモデル学習時の動作を説明するための図である。本実施形態では、属性に対して属性値が予め付与された基準データ２０１を用い、基準データ２０１の潜在変数を基に属性値を重回帰モデルで予測できるように、変分オートエンコーダ関連技術において新規の損失関数を用いてモデル学習を実行する。 [Embodiment 1]
(Operations during model learning and data reconstruction)
FIG. 4 is a diagram for explaining the operation of the information processing system 1 according to the first embodiment during model learning. In this embodiment, variational autoencoder-related technology is used to predict attribute values using a multiple regression model based on latent variables of the reference data 201 using reference data 201 in which attribute values are assigned to attributes in advance. Perform model training using the new loss function.

情報処理システム１は、ＭＦＣＶＡＥ２を有する。ＭＦＣＶＡＥ２は、エンコーダ２０３とデコーダ２０５とを含んで構成される。 The information processing system 1 has an MFCVAE2. MFCVAE2 is configured to include an encoder 203 and a decoder 205.

ＭＦＣＶＡＥ２のモデル学習時には、エンコーダ２０３に入力されるデータ（訓練データセット）は、基準データ２０１のみを含むか、あるいは基準データ２０１及び評価データ２０２の両方を含む。エンコーダ２０３は、ＭＦＣＶＡＥ２の中間出力である潜在変数２０４を出力する。ＭＦＣＶＡＥ２は、潜在変数２０４を説明変数とし、属性値の正解値（正解ラベル）２０８を目的変数とする重回帰モデルを設定する。なお重回帰モデルに限らず、線形回帰モデル及び非線形回帰モデルの何れもよい。重回帰モデルは、モデル計算の負荷が少ないという利点があることから、本実施形態では重回帰モデルを採用する。 During model learning of MFCVAE2, the data (training data set) input to the encoder 203 includes only the reference data 201 or both the reference data 201 and the evaluation data 202. Encoder 203 outputs latent variable 204, which is an intermediate output of MFCVAE2. MFCVAE2 sets a multiple regression model in which the latent variable 204 is an explanatory variable and the correct value (correct label) 208 of the attribute value is an objective variable. Note that the present invention is not limited to the multiple regression model, and may be either a linear regression model or a nonlinear regression model. Since the multiple regression model has the advantage that the load of model calculation is small, the multiple regression model is adopted in this embodiment.

情報処理システム１は、潜在変数２０４と属性値の正解値２０８とから、属性値の予測値と正解値との平均二乗誤差を最小化する偏回帰係数２０９及びその時の属性値の予測値２１０を求める。属性値の予測値２１０は、偏回帰係数２０９を各係数とする潜在変数２０４の一次結合として算出される。属性値の正解値２０８と属性値の予測値２１０とに基づいて、重回帰モデルへの適合（重回帰モデルの当てはまり）が良いほど小さい値を取る適合度を表す指標となり得る決定係数２１１や予測誤差２１２が求められる。決定係数２１１や予測誤差２１２は、重回帰モデルへの適合度を表す指標を含む損失関数２１３で学習される。このようにしてエンコーダ２０３が学習される。 The information processing system 1 calculates a partial regression coefficient 209 that minimizes the mean square error between the predicted value of the attribute value and the correct value 208 of the attribute value, and the predicted value 210 of the attribute value at that time, from the latent variable 204 and the correct value 208 of the attribute value. demand. The predicted value 210 of the attribute value is calculated as a linear combination of the latent variables 204 whose respective coefficients are the partial regression coefficients 209. Based on the correct value 208 of the attribute value and the predicted value 210 of the attribute value, the coefficient of determination 211 or prediction that can be an index representing the goodness of fit takes a smaller value as the fit to the multiple regression model (fitting of the multiple regression model) is better. An error 212 is determined. The coefficient of determination 211 and the prediction error 212 are learned using a loss function 213 that includes an index representing the degree of adaptation to the multiple regression model. Encoder 203 is trained in this way.

潜在変数２０４は、デコーダ２０５に入力される。デコーダ２０５は、デコーダ２０５によって再構成された基準データ２０１である再構成基準データ２０６と、デコーダ２０５によって再構成された評価データ２０２である再構成評価データ２０７とを出力する。再構成評価データ２０７は、属性値が付与されたデータとなっている。 Latent variable 204 is input to decoder 205 . The decoder 205 outputs reconstructed reference data 206, which is the reference data 201 reconstructed by the decoder 205, and reconstructed evaluation data 207, which is the evaluation data 202 reconstructed by the decoder 205. The reconstruction evaluation data 207 is data to which attribute values are assigned.

ここで、偏回帰係数２０９、損失関数２１３、決定係数２１１、予測誤差２１２を説明する。 Here, the partial regression coefficient 209, loss function 213, coefficient of determination 211, and prediction error 212 will be explained.

従来技術のＭＦＣＶＡＥでは、目的関数である変分下限（Evidence Lower Bound（ＥＬＢＯ）)は、式（１）のように表される。従来手法のＭＦＣＶＡＥでは、式（１）の変分下限の符号をマイナスにした負の損失関数が最小化されるようにＭＦＣＶＡＥモデルのパラメータが学習される（上述の文献２参照）。式（１）において“Ｄ”は訓練データセット、“ｘ”が訓練データセットに含まれる訓練データ、“ｚ→”は潜在変数、“θ”はエンコーダのパラメータ、“φ”はデコーダのパラメータ、“ＫＬ（Ａ|Ｂ）”は分布Ａと分布ＢのＫＬダイバージェンスを表す。 In the conventional MFCVAE, a variational lower bound (ELBO), which is an objective function, is expressed as in equation (1). In the conventional method of MFCVAE, the parameters of the MFCVAE model are learned so that a negative loss function in which the sign of the lower limit of variation in Equation (1) is made negative is minimized (see Document 2 mentioned above). In equation (1), "D" is the training dataset, "x" is the training data included in the training dataset, "z→" is the latent variable, "θ" is the encoder parameter, "φ" is the decoder parameter, “KL(A|B)” represents the KL divergence of distribution A and distribution B.

これに対して本実施形態では、観点ｊでの自由度調整済み決定係数をＲ_ｆ，ｊ ^２、各自由度調整済み決定係数Ｒ_ｆ，ｊ ^２の重み係数をγ_ｊ（＞０）として、式（２）のように目的関数を設定する。式（２）の右辺は、式（１）の右辺の期待値Ｅ［*］のカッコ内の式に第３項γ_ｊＲ_ｆ，ｊ ^２が追加されたものである。 On the other hand, in this embodiment, the degree-of-freedom adjusted coefficient of determination at viewpoint j is R _f,j ² , and the weighting coefficient of each degree-of-freedom adjusted coefficient of determination R _f,j ² is γ _j (>0). The objective function is set as shown in equation (2). The right side of equation (2) is obtained by adding the third term γ _j R _f,j ² to the expression in parentheses of the expected value E[*] on the right side of equation (1).

なお、式（２）の第３項では、自由度調整済み決定係数Ｒ_ｆ，ｊ ^２ではなく、後述の決定係数Ｒ_ｊ ^２が採用されてもよい。決定係数Ｒ_ｊ ^２、自由度調整済み決定係数Ｒ_ｆ，ｊ ^２は、決定係数２１１の一例である。 Note that, in the third term of Equation (2), a coefficient of determination R j ² , which will be described later, may be used instead of the degree-of-freedom adjusted coefficient of determination R _f,j ₂ ^. The coefficient of determination R _j ² and the degree of freedom adjusted coefficient of determination R _f,j ² are examples of the coefficient of determination 211.

ここで、式（２）の期待値Ｅ［*］のカッコ内の第１項は、ＭＦＣＶＡＥによるデータ再構成の誤差を表す再構成誤差項である。式（２）の期待値Ｅ［*］のカッコ内の第２項は、ＭＦＣＶＡＥの潜在変数のバラつきを抑制する等の潜在変数の分布に制約を与える正則化項である。式（２）の期待値Ｅ［*］のカッコ内の第３項は、属性毎の潜在変数及び属性値に対する重回帰モデルへの適合が良いほど小さい値を取る指標に基づく追加項である。 Here, the first term in parentheses of the expected value E[*] in equation (2) is a reconstruction error term representing an error in data reconstruction by MFCVAE. The second term in parentheses of the expected value E[*] in equation (2) is a regularization term that imposes constraints on the distribution of the latent variables, such as suppressing variations in the latent variables of MFCVAE. The third term in parentheses of the expected value E[*] in Equation (2) is an additional term based on an index that takes a smaller value as the fit to the multiple regression model for the latent variables and attribute values for each attribute is better.

損失関数２１３（損失関数Ｌｏｓｓと表す）は負の目的関数であるから、式（２）の目的関数を用いて式（３）のように表される。 Since the loss function 213 (expressed as loss function Loss) is a negative objective function, it is expressed as in Expression (3) using the objective function of Expression (2).

本実施形態では、損失関数Ｌｏｓｓが最小化、すなわち決定係数Ｒ_ｆ，ｉ ^２が最大化されるように、ＭＦＣＶＡＥモデルのパラメータが学習される。 In this embodiment, the parameters of the MFCVAE model are learned so that the loss function Loss is minimized, that is, the coefficient of determination R _f,i ² is maximized.

次に、観点ｊの決定係数Ｒ_ｊ ^２、及び観点ｊの自由度調整済み決定係数Ｒ_ｆ，j ^２の算出方法を説明する。 Next, a method of calculating the coefficient of determination R _j ² of viewpoint j and the degree of freedom adjusted coefficient of determination R _f,j ² of viewpoint j will be described.

データ数がＮ個、データの属性の種類がｊ＝１，２，…，ＪのＪ個、ある属性ｊにおける潜在変数の次元数をＫ_ｊとする。またデータの各属性がＭＦＣＶＡＥのＪ個の各観点と一対一に対応しているものとする。インデックス番号ｎであるデータｎに属性ｊの属性値が付与されていれば、データｎは属性ｊにおいて基準データである。一方データｎに属性ｊの属性値が付与されていなければ、データｎは属性ｊにおいて評価データである。 Assume that the number of data is N, the types of data attributes are J = 1, 2, . . . , J, and the number of dimensions of a latent variable in a certain attribute j is _Kj . It is also assumed that each attribute of the data has a one-to-one correspondence with each of the J viewpoints of the MFCVAE. If the attribute value of attribute j is assigned to data n having index number n, data n is reference data in attribute j. On the other hand, if the attribute value of attribute j is not assigned to data n, data n is evaluation data for attribute j.

ある属性ｊに関する基準データのインデックスの集合をＢ_ｊ、集合Ｂ_ｊの要素数をＭ_ｊとする。集合Ｂ_ｊ＝｛ｂ_ｊ，１，ｂ_ｊ，２，…，ｂ_ｊ，Ｍｊ}とする。ある属性ｊに関する基準データｎにおける潜在変数をｚ_ｎ，ｊ＝｛ｚ_{ｎ，ｊ，１}，ｚ_{ｎ，ｊ，２}，…，ｚ_{ｎ，ｊ，Ｋｊ}}、属性値の正解値をｙ_ｎ，ｊ、潜在変数ｚ_ｎ，ｊを説明変数とする。また、属性値を目的変数とした重回帰モデルの偏回帰係数をｗ_ｊ＝｛ｗ_ｊ，０，ｗ_ｊ，１，ｗ_ｊ，２，…，ｗ_ｊ，Ｋｊ}^Ｔ、属性値の予測値をｙ^_ｎ，ｊとする。属性値の予測値ｙ^_ｎ，ｊは、式（４）のように表される。 Let B _j be a set of indexes of reference data regarding a certain attribute j, and M _j be the number of elements in the set B _j . Set B _j ={b _j,1 , b _j,2 , ..., b _{j, Mj} }. Let the latent variable in reference data n regarding a certain attribute j be z _n,j ={z _n,j,1 ,z _n,j,2 ,...,z _n,j,Kj }, and let the correct value of the attribute value be y _{n, j} and latent variable z _n,j are explanatory variables. In addition, the partial regression coefficient of the multiple regression model with the attribute value as the objective variable is w _j = {w _j,0 , w _j,1 , w _j,2 , ..., w _{j, Kj} } ^T , and the predicted value of the attribute value is Let be y^ _n,j . The predicted value y^ _n,j of the attribute value is expressed as in equation (4).

ただし、式（５）のように潜在変数ベクトルＺ_ｎ，ｊを定義した。

However, the latent variable vector Z _n,j was defined as in equation (5).

ある属性ｊに関する属性値ｙ_ｎ，ｊ（ただしｎ∈Ｂ_ｊ）と重回帰モデルによる属性値の予測値ｙ^_ｎ，ｊとの予測誤差である平均二乗誤差ＭＳＥ_ｊは、式（６）のように表される。平均二乗誤差ＭＳＥｊは、予測誤差２１２の一例である。

The mean squared error MSE j, which is the prediction error between the attribute value y _n,j (where n∈B _j ) regarding a certain attribute j and the predicted value y^ _n,j _of the attribute value by the multiple regression model, is expressed by equation (6). It is expressed as follows. The mean square error MSEj is an example of the prediction error 212.

ここで平均二乗誤差ＭＳＥ_ｊを最小化するｗ_ｊは、式（６）の右辺をｗ_ｊで偏微分してゼロとおく（∇ｗ_ｊ＝０）ことで、式（７）のように、偏回帰係数ｗ_ｊは、潜在変数Ｚ_ｊと属性値ｙ_ｊの関数となる。偏回帰係数ｗ_ｊは、偏回帰係数２０９の一例である。

Here, w _j that minimizes the mean square error MSE _j is obtained by partially differentiating the right side of equation (6) with w _j and setting it to zero (∇ w _j =0), as shown in equation (7). The partial regression coefficient w _j is a function of the latent variable Z _j and the attribute value y _j . The partial regression coefficient w _j is an example of the partial regression coefficient 209.

ただし式（７）において、潜在変数Ｚ_ｊと属性値ｙ_ｊを、式（８）と式（９）に示すようにおいた。

However, in equation (7), latent variable Z _j and attribute value y _j are set as shown in equation (8) and equation (9).

なお、ある属性ｊに関する属性値ｙ_ｎ，ｊ（ただしｎ∈Ｂ_ｊ）と重回帰モデルによる属性値の予測値ｙ^_ｎ，ｊとの予測誤差は、平均二乗誤差に限らず、平均誤差、平均絶対誤差、平均平方二乗誤差、平均誤差率、平均絶対誤差率等を採用することもできる。 Note that the prediction error between the attribute value y _n,j (where n∈B _j ) regarding a certain attribute j and the predicted value y^ _n,j of the attribute value by the multiple regression model is not limited to the mean square error, but also the average error, Mean absolute error, mean squared error, mean error rate, mean absolute error rate, etc. may also be employed.

ある属性ｊに関する決定係数Ｒ_ｊ ^２は、説明変数が目的変数をどれくらい説明しているかを表す。決定係数Ｒ_ｊ ^２は、属性値の平均値ｙ￣_ｎ，ｊを用いて式（１０）のように表される。

The coefficient of determination R _j ² for a certain attribute j represents how much the explanatory variable explains the objective variable. The coefficient of determination R _j ² is expressed as in equation (10) using the average value y _n,j of the attribute values.

また決定係数は、説明変数の数が増えるほど１に近づくという性質を持っているため、説明変数の数が多い場合には、この性質を補正した自由度調整済み決定係数Ｒ_ｆ，ｊ ^２が採用されてもよい。自由度調整済み決定係数Ｒ_ｆ，ｊ ^２は、説明変数の数をｐとし、基準データのサンプル数はＭ_ｊであるので、式（１１）のように表される。

In addition, the coefficient of determination has the property that it approaches 1 as the number of explanatory variables increases, so when the number of explanatory variables is large, the degree of freedom adjusted coefficient of determination R _f,j ² that corrects this property is May be adopted. The degree of freedom adjusted coefficient of determination R _f,j ² is expressed as equation (11) since the number of explanatory variables is p and the number of samples of reference data is M _j .

なお、重み係数γ_ｊは、再構成誤差項、正則化項、及び決定係数Ｒ_ｊ ^２（Ｒ_ｆ，ｊ ^２）の絶対値の比較から求めることができる。具体的には、決定係数Ｒ_ｊ ^２に対する重み係数γ_ｊは、|γ_ｊＲ_ｊ ^２|のオーダーが式（２）の右辺の期待値Ｅ［*］のカッコ内の式の再構成誤差項と正則化項の各絶対値のオーダーと同じになるように定められる。同様に、自由度調整済み決定係数Ｒ_ｆ，ｊ ^２に対する重み係数γ_ｊも、|γ_ｊＲ_ｆ，ｊ ^２|のオーダーが、式（２）の右辺の期待値Ｅ［*］のカッコ内の式の再構成誤差項と正則化項の各絶対値のオーダーと同じになるように定められる。 Note that the weighting coefficient γ _j can be obtained from a comparison of the absolute values of the reconstruction error term, the regularization term, and the coefficient of determination R _j ² (R _{f, j} ² ). Specifically, the weighting coefficient γ _j for the coefficient of determination R _j ² is the reconstruction error term of the expression in parentheses of the expected value E[*] on the right side ^of equation ( ₂ ), where the order of |γ _j R j 2 | is determined to be the same as the order of each absolute value of the regularization term. Similarly, the weighting coefficient γ _j for the degree-of-freedom adjusted coefficient of determination R _{f, j} ² also has the order of |γ _j R _{f, j} ² | within the parentheses of the expected value E[*] on the right side of equation (2). is determined to be the same as the order of the absolute values of the reconstruction error term and regularization term in the equation.

基準データ２０１及び評価データ２０２の再構成時には、デコーダ２０５は、基準データ２０１及び評価データ２０２の属性及び属性値（評価データ２０２の場合は付与された属性値）と、エンコーダ２０３の学習の最後のエポックで得た偏回帰係数２０９を用いる。そして、基準データ２０１及び評価データ２０２の属性及び属性値と、偏回帰係数２０９とを用いて、式（４）から、潜在変数（潜在変数ベクトルＺ_ｎ，ｊ）を算出する。そして、デコーダ２０５は、算出した潜在変数を入力として、入力された基準データ２０１及び評価データ２０２をそれぞれ再構成した再構成基準データ２０６及び再構成評価データ２０７を出力する。 When reconstructing the reference data 201 and the evaluation data 202, the decoder 205 uses the attributes and attribute values of the reference data 201 and the evaluation data 202 (the assigned attribute values in the case of the evaluation data 202), and the last learning of the encoder 203. The partial regression coefficient 209 obtained in each epoch is used. Then, a latent variable (latent variable vector Z _n,j ) is calculated from equation (4) using the attributes and attribute values of the reference data 201 and evaluation data 202 and the partial regression coefficient 209. Then, the decoder 205 inputs the calculated latent variables and outputs reconstructed reference data 206 and reconstructed evaluation data 207, which are obtained by reconstructing the input reference data 201 and evaluation data 202, respectively.

（属性値付与時の動作）
図５は、実施形態１に係る情報処理システム１の評価データ２０２に対する属性値付与時の動作を説明するための図である。情報処理システム１は、評価データ２０２への属性値付与時には、先ず評価データ２０２を学習済みのエンコーダ２０３に入力し、潜在変数２０４を得る。情報処理システム１は、ＭＦＣＶＡＥ２（図４）の学習の最終エポックで得た偏回帰係数２０９を用い、属性値の予測値２１０を、偏回帰係数２０９を各係数とする潜在変数２０４の一次結合式で算出する。情報処理システム１は、属性値の予測値２１０を評価データ２０２に付与する。 (Behavior when assigning attribute value)
FIG. 5 is a diagram for explaining the operation of the information processing system 1 according to the first embodiment when assigning attribute values to the evaluation data 202. When assigning attribute values to the evaluation data 202, the information processing system 1 first inputs the evaluation data 202 into the trained encoder 203 to obtain latent variables 204. The information processing system 1 uses the partial regression coefficients 209 obtained in the final epoch of learning of MFCVAE2 (FIG. 4) to create a linear combination equation of the latent variables 204 using the predicted value 210 of the attribute value and the partial regression coefficient 209 as each coefficient. Calculate with. The information processing system 1 assigns the predicted value 210 of the attribute value to the evaluation data 202.

（データ生成時の動作）
図６は、実施形態１に係る情報処理システム１の属性値を指定したデータ生成時の動作を説明するための図である。データ生成とは、変分オートエンコーダ関連技術において、潜在変数を入力として、デコーダからデータを出力するこという。情報処理システム１は、生成させたい属性値４０１を持つデータ４０５の生成時には、ユーザが生成させたい属性及び属性値４０１と、ＭＦＣＶＡＥ２（図４）の学習の最後で得た偏回帰係数２０９とから潜在変数２０４を算出する。そして情報処理システム１は、算出した潜在変数２０４をデコーダ２０５に入力することで、生成させたい属性値４０１を持つデータ４０５を生成する。 (Operation during data generation)
FIG. 6 is a diagram for explaining the operation of the information processing system 1 according to the first embodiment when generating data with specified attribute values. Data generation, in variational autoencoder related technology, refers to outputting data from a decoder using latent variables as input. When generating data 405 having an attribute value 401 that the user wants to generate, the information processing system 1 uses the attribute and attribute value 401 that the user wants to generate and the partial regression coefficient 209 obtained at the end of learning of MFCVAE2 (FIG. 4). A latent variable 204 is calculated. Then, the information processing system 1 inputs the calculated latent variable 204 to the decoder 205 to generate data 405 having the attribute value 401 that is desired to be generated.

なお、データ生成の際に、指定された属性及び属性値４０１に該当する基準データ２０１が存在する場合には、この基準データ２０１に対応するデータを再構成したデータ４０５として採用する。指定された属性及び属性値４０１に該当する基準データ２０１が存在しない場合に、生成させたい属性及び属性値４０１と、偏回帰係数２０９とから潜在変数２０４を算出する。そして、算出した潜在変数２０４をデコーダ２０５に入力することで、生成させたい属性値４０１を持つデータ４０５を生成する。 Note that when the data is generated, if there is reference data 201 that corresponds to the specified attribute and attribute value 401, the data corresponding to this reference data 201 is adopted as the reconstructed data 405. If the reference data 201 corresponding to the specified attribute and attribute value 401 does not exist, a latent variable 204 is calculated from the attribute and attribute value 401 to be generated and the partial regression coefficient 209. Then, by inputting the calculated latent variable 204 to the decoder 205, data 405 having the attribute value 401 to be generated is generated.

（モデル学習時の決定係数の推移）
図７は、実施形態１に係る情報処理システム１のモデル学習時の決定係数の推移を説明するための図である。図７のグラフでは、潜在変数を横軸、属性値を縦軸に取り、属性値の実際の値を点で表し、属性値の予測値を直線で表している。情報処理システム１は、ＭＦＣＶＡＥ２の損失関数に、潜在変数を説明変数、属性値を目的関数とする重回帰モデルの決定係数を含む追加項を追加し、決定係数が高くなるようにＭＦＣＶＡＥ２を学習させる。その結果、学習の初期では決定係数は低い（図７（ａ））が、学習のエポック数が進行して学習の中期（図７（ｂ））、後期（図７（ｃ））と推移するに従って、決定係数は高くなり、潜在変数に基づく属性値の予測精度が高くなる。 (Transition of coefficient of determination during model learning)
FIG. 7 is a diagram for explaining the transition of the coefficient of determination during model learning of the information processing system 1 according to the first embodiment. In the graph of FIG. 7, latent variables are plotted on the horizontal axis, attribute values are plotted on the vertical axis, actual values of the attribute values are represented by points, and predicted values of the attribute values are represented by straight lines. The information processing system 1 adds an additional term to the loss function of MFCVAE2, including the coefficient of determination of a multiple regression model in which latent variables are explanatory variables and attribute values are objective functions, and trains MFCVAE2 so that the coefficient of determination becomes high. . As a result, the coefficient of determination is low at the early stage of learning (Figure 7(a)), but as the number of epochs of learning progresses, it changes to the middle stage of learning (Figure 7(b)) and then to the late stage (Figure 7(c)). Accordingly, the coefficient of determination becomes high, and the accuracy of predicting attribute values based on latent variables becomes high.

（実施形態１に係る情報処理システム１の構成）
図８は、実施形態１に係る情報処理システム１の構成を示すブロック図である。情報処理システム１は、データ記憶部６０２、特徴量抽出部６０３、属性値付与部６０８、データ生成部６１４、及びデータ品質評価部６１２を有する。 (Configuration of information processing system 1 according to Embodiment 1)
FIG. 8 is a block diagram showing the configuration of the information processing system 1 according to the first embodiment. The information processing system 1 includes a data storage section 602 , a feature extraction section 603 , an attribute value assignment section 608 , a data generation section 614 , and a data quality evaluation section 612 .

データ記憶部６０２は、メモリ又はストレージであり、基準データ２０１と評価データ２０２の入力を受け付け、蓄積する。データ記憶部６０２は、情報処理システム１に含まれる装置であっても、情報処理システム１の外部装置であっても何れでもよい。 The data storage unit 602 is a memory or storage, and receives input of the reference data 201 and evaluation data 202 and stores them. The data storage unit 602 may be a device included in the information processing system 1 or an external device to the information processing system 1.

特徴量抽出部６０３は、ＭＦＣＶＡＥ２のデータ記憶部６０２に格納されている基準データ２０１を元に、ＭＦＣＶＡＥ２のモデル学習を実行する。また特徴量抽出部６０３は、ＭＦＣＶＡＥ２のデータ記憶部６０２に格納されている評価データ２０２の属性推定を行う。また特徴量抽出部６０３は、属性値を指定したデータ生成を行う。特徴量抽出部６０３は、回帰モデル適合度評価部６０４、損失算出部６０５、モデル更新部６０６、及びエンコーダ部６０７を有する。特徴量抽出部６０３の処理機能は、図９を参照して後述する。 The feature extraction unit 603 executes model learning of the MFCVAE2 based on the reference data 201 stored in the data storage unit 602 of the MFCVAE2. Further, the feature amount extraction unit 603 performs attribute estimation of the evaluation data 202 stored in the data storage unit 602 of the MFCVAE2. Further, the feature extraction unit 603 generates data specifying attribute values. The feature extraction unit 603 includes a regression model fitness evaluation unit 604, a loss calculation unit 605, a model update unit 606, and an encoder unit 607. The processing function of the feature extraction unit 603 will be described later with reference to FIG.

属性値付与部６０８は、評価データ２０２の属性推定を行い、評価データ２０２の属性及び属性値６１１を出力する。属性値付与部６０８は、属性値推定部６０９と、属性及び属性値出力部６１０とを有する。属性値付与部６０８の処理機能は、図１０を参照して後述する。 The attribute value assigning unit 608 estimates the attributes of the evaluation data 202 and outputs the attributes and attribute values 611 of the evaluation data 202. The attribute value assigning unit 608 includes an attribute value estimating unit 609 and an attribute and attribute value output unit 610. The processing function of the attribute value assigning unit 608 will be described later with reference to FIG.

データ生成部６１４は、属性値付与部６０８によって出力された対象データの属性及び属性値６１１を入力として、属性値を指定したデータ生成を行い、生成したデータ４０５を出力する。データ生成部６１４は、潜在変数算出部６１５、デコード部６１６、及びデータ出力部６１７を有する。データ生成部６１４の処理機能は、図１１を参照して後述する。 The data generation unit 614 inputs the attribute and attribute value 611 of the target data output by the attribute value assignment unit 608, generates data specifying the attribute value, and outputs the generated data 405. The data generation section 614 includes a latent variable calculation section 615, a decoding section 616, and a data output section 617. The processing functions of the data generation unit 614 will be described later with reference to FIG. 11.

データ品質評価部６１２は、属性値付与部６０８によって出力された対象データ（基準データ２０１、評価データ２０２）の属性及び属性値６１１に基づいて対象データの品質評価を行い、データ品質評価結果６１３を出力する。 The data quality evaluation unit 612 evaluates the quality of the target data based on the attributes and attribute values 611 of the target data (reference data 201, evaluation data 202) output by the attribute value assigning unit 608, and outputs the data quality evaluation result 613. Output.

データ品質評価部６１２は、評価の対象データの属性及び属性値６１１を用いて、一例として、下記のような観点で対象データの品質評価を行う。品質評価については“機械学習品質マネジメントガイドライン”、国立研究開発法人産業技術総合研究所、［令和４年８月１日検索］、インターネット＜URL：https://www.aist.go.jp/aist_j/press_release/pr2020/pr20200630_2/pr20200630_2.html＞を参照すればよい。
（１）データ設計の十分性：データを用いる対象のシステムが対応すべき様々な状況に対して十分な訓練データやテストデータを確保していること。
（２）データセットの被覆性：基準を定めて網羅したそれぞれのケースに対してそれぞれのケースに対応する入力の可能性に対して抜け漏れなく、レアケース及び通常ケースそれぞれに正しく推論できる学習に必要な十分な量のデータが与えられていること。
（３）データの均一性：全体として推論性能の期待値を最大化するように、訓練データを偏り無く用意すること。 The data quality evaluation unit 612 uses the attributes and attribute values 611 of the evaluation target data to evaluate the quality of the target data from the following viewpoints, for example. For quality evaluation, see “Machine Learning Quality Management Guidelines,” National Institute of Advanced Industrial Science and Technology, [Retrieved August 1, 2020], Internet <URL: https://www.aist.go.jp/ aist_j/press_release/pr2020/pr20200630_2/pr20200630_2.html>.
(1) Sufficiency of data design: Ensuring sufficient training data and test data for the various situations that the target system that uses the data must respond to.
(2) Coverage of the data set: For each case that has been defined and covered, it is possible to learn to correctly infer both rare cases and normal cases without missing any omissions regarding the possibility of input corresponding to each case. That sufficient amount of data is provided.
(3) Data uniformity: Training data should be prepared without bias so as to maximize the expected value of inference performance as a whole.

なお、データ品質評価部６１２の処理機能は、図１２を参照して後述する。 Note that the processing functions of the data quality evaluation unit 612 will be described later with reference to FIG. 12.

また、特徴量抽出部６０３、属性値付与部６０８、データ生成部６１４、及びデータ品質評価部６１２は、１つのコンピュータ上に実現されていてもよいし、異なるコンピュータ上に実現されてもよく、これらの統合分散の形態は適宜変更可能である。 Further, the feature amount extraction unit 603, the attribute value assignment unit 608, the data generation unit 614, and the data quality evaluation unit 612 may be realized on one computer, or may be realized on different computers. These forms of integration and distribution can be changed as appropriate.

（実施形態１に係る特徴量抽出処理）
図９は、実施形態１に係る特徴量抽出処理を示すフローチャートである。特徴量抽出処理は、特徴量抽出部６０３（図８）によって、ユーザ指示を契機として実行される。 (Feature quantity extraction processing according to Embodiment 1)
FIG. 9 is a flowchart showing feature extraction processing according to the first embodiment. The feature extraction process is executed by the feature extraction unit 603 (FIG. 8) in response to a user instruction.

先ずステップＳ１１では、回帰モデル適合度評価部６０４は、基準データ２０１の潜在変数を説明変数とし、属性値を目的変数としたＭＦＣＶＡＥモデル（本実施形態では重回帰モデル）の当てはまりの良さを示す指標を損失関数Ｌｏｓｓに設定する。損失関数Ｌｏｓｓに設定される指標は、本実施形態では、観点ｊでの自由度調整済み決定係数Ｒ_ｆ，ｊ ^２である。 First, in step S11, the regression model fitness evaluation unit 604 uses an index indicating the goodness of fit of the MFCVAE model (multiple regression model in this embodiment) with the latent variables of the reference data 201 as explanatory variables and the attribute values as objective variables. is set to the loss function Loss. In this embodiment, the index set to the loss function Loss is the degree-of-freedom adjusted coefficient of determination R _f,j ² at the viewpoint j.

次にステップＳ１２では、回帰モデル適合度評価部６０４は、ＭＦＣＶＡＥモデルの初期化を行う。次にステップＳ１３では、回帰モデル適合度評価部６０４は、基準データ２０１及び評価データ２０２をＭＦＣＶＡＥモデルに入力する。ステップＳ１３では、回帰モデル適合度評価部６０４は、少なくとも基準データ２０１をＭＦＣＶＡＥモデルに入力すればよい。 Next, in step S12, the regression model fitness evaluation unit 604 initializes the MFCVAE model. Next, in step S13, the regression model fitness evaluation unit 604 inputs the reference data 201 and the evaluation data 202 to the MFCVAE model. In step S13, the regression model fitness evaluation unit 604 may input at least the reference data 201 to the MFCVAE model.

次にステップＳ１４では、損失算出部６０５は、式（３）に基づいて損失関数Ｌｏｓｓの関数値を算出する。回帰モデル適合度評価部６０４は、損失算出部６０５による損失関数Ｌｏｓｓの関数値の算出の前段階として、次の処理を行う。すなわち、回帰モデル適合度評価部６０４は、ステップＳ１３の入力データの入力に対してＭＦＣＶＡＥモデルから出力された潜在変数を説明変数とし、属性値を目的変数とする重回帰モデルを属性毎に設定する。次に、回帰モデル適合度評価部６０４は、潜在変数と属性値とから、属性値に対する予測誤差が最小となる属性値の予測値及び重回帰モデルの回帰係数を属性毎に算出する。次に、回帰モデル適合度評価部６０４は、算出された属性毎の予測値及び回帰係数に基づいて、潜在変数及び属性値に対する重回帰モデルへの適合が良いほど小さい値を取る指標を属性毎に算出する。その後、損失算出部６０５は、ステップＳ１４で、損失関数Ｌｏｓｓの関数値を算出する。 Next, in step S14, the loss calculation unit 605 calculates the function value of the loss function Loss based on equation (3). The regression model fitness evaluation unit 604 performs the following process as a step before the loss calculation unit 605 calculates the function value of the loss function Loss. That is, the regression model suitability evaluation unit 604 sets a multiple regression model for each attribute, using the latent variable output from the MFCVAE model as an explanatory variable and the attribute value as an objective variable in response to the input data in step S13. . Next, the regression model suitability evaluation unit 604 calculates, for each attribute, the predicted value of the attribute value and the regression coefficient of the multiple regression model that minimizes the prediction error for the attribute value, from the latent variable and the attribute value. Next, based on the calculated predicted values and regression coefficients for each attribute, the regression model fitness evaluation unit 604 sets an index for each attribute that takes a smaller value as the fit to the multiple regression model for the latent variables and attribute values is better. Calculated as follows. After that, the loss calculation unit 605 calculates the function value of the loss function Loss in step S14.

なお、入力データが基準データ２０１及び評価データ２０２を含む場合、ステップＳ１４では、回帰モデル適合度評価部６０４は、損失関数Ｌｏｓｓの追加項を、基準データ２０１を用いて計算する。一方、損失算出部６０５は、再構成誤差項及び正則化項を、基準データ２０１及び評価データ２０２の何れか一方又は両方を用いて計算する。これは、損失関数Ｌｏｓｓの追加項は、潜在変数と属性値との重回帰モデルへの適合度に基づくことから、属性値を含む基準データのみ損失関数Ｌｏｓｓの追加項を計算可能なためである。 Note that when the input data includes the reference data 201 and the evaluation data 202, in step S14, the regression model fitness evaluation unit 604 calculates an additional term of the loss function Loss using the reference data 201. On the other hand, the loss calculation unit 605 calculates the reconstruction error term and the regularization term using either or both of the reference data 201 and the evaluation data 202. This is because the additional term of the loss function Loss is based on the goodness of fit to the multiple regression model between latent variables and attribute values, so the additional term of the loss function Loss can be calculated only for reference data that includes attribute values. .

次にステップＳ１５では、モデル更新部６０６は、誤差逆伝搬によりＭＦＣＶＡＥモデルのパラメータを更新する。次にステップＳ１６では、モデル更新部６０６は、所定条件（エポック数が所定回数を超えたか、又はＭＦＣＶＡＥモデルによる推定値と実際の値の誤差が所定値を下回った）が充足されたかを判定する。モデル更新部６０６は、所定条件が充足された場合（ステップＳ１６ＹＥＳ）にステップＳ１７に処理を移し、所定条件が充足されていない場合（ステップＳ１６ＮＯ）にステップＳ１３に処理を戻す。 Next, in step S15, the model updating unit 606 updates the parameters of the MFCVAE model by error backpropagation. Next, in step S16, the model update unit 606 determines whether a predetermined condition (the number of epochs exceeds a predetermined number, or the error between the estimated value by the MFCVAE model and the actual value is less than a predetermined value) is satisfied. . The model update unit 606 moves the process to step S17 when the predetermined condition is satisfied (step S16 YES), and returns the process to step S13 when the predetermined condition is not satisfied (step S16 NO).

ステップＳ１７では、エンコーダ部６０７は、学習済みのＭＦＣＶＡＥモデルのエンコーダ２０３に評価データ２０２を入力し、潜在変数２０４と偏回帰係数２０９を出力する。 In step S17, the encoder unit 607 inputs the evaluation data 202 to the encoder 203 of the learned MFCVAE model, and outputs the latent variable 204 and the partial regression coefficient 209.

（実施形態１に係る属性値付与処理）
図１０は、実施形態１に係る属性値付与処理を示すフローチャートである。属性値付与処理は、属性値付与部６０８（図８）によって、ユーザ指示を契機として実行される。 (Attribute value assignment processing according to Embodiment 1)
FIG. 10 is a flowchart showing attribute value assignment processing according to the first embodiment. The attribute value assignment process is executed by the attribute value assignment unit 608 (FIG. 8) in response to a user instruction.

先ずステップＳ２１では、属性値推定部６０９は、特徴量抽出部６０３（エンコーダ２０３）から得られた潜在変数２０４と偏回帰係数２０９から、評価データ２０２の属性値の予測値２１０を算出する。次にステップＳ２２では、属性及び属性値出力部６１０は、基準データ２０１と、ステップＳ２１で属性値の予測値２１０が算出された評価データ２０２との属性及び属性値から、属性毎の各属性値の出現頻度のヒストグラムを求める（後述の図１４参照）。そして属性及び属性値出力部６１０は、このヒストグラムをもとに各属性値が基準データ２０１及び評価データ２０２の各属性において出現する確率をデータ含有率として求め、結果を出力する。 First, in step S21, the attribute value estimating unit 609 calculates the predicted value 210 of the attribute value of the evaluation data 202 from the latent variable 204 and the partial regression coefficient 209 obtained from the feature extracting unit 603 (encoder 203). Next, in step S22, the attribute and attribute value output unit 610 calculates each attribute value for each attribute from the attributes and attribute values of the reference data 201 and the evaluation data 202 for which the predicted value 210 of the attribute value was calculated in step S21. A histogram of the appearance frequency of is obtained (see FIG. 14 described later). Then, the attribute and attribute value output unit 610 calculates the probability that each attribute value appears in each attribute of the reference data 201 and the evaluation data 202 as a data content rate based on this histogram, and outputs the result.

（実施形態１に係るデータ生成処理）
図１１は、実施形態１に係るデータ生成処理を示すフローチャートである。データ生成処理は、データ生成部６１４（図８）によって、ユーザ指示を契機として実行される。 (Data generation processing according to Embodiment 1)
FIG. 11 is a flowchart showing data generation processing according to the first embodiment. The data generation process is executed by the data generation unit 614 (FIG. 8) in response to a user instruction.

先ずステップＳ３１では、データ生成部６１４は、ユーザによる生成させたい属性及び属性値４０１の入力を受け付ける。次にステップＳ３２では、潜在変数算出部６１５は、ステップＳ３１で入力を受け付けた属性及び属性値４０１と偏回帰係数２０９から、潜在変数２０４を計算して出力する。次にステップＳ３３では、デコード部６１６（デコーダ２０５）は、ステップＳ３２で計算された潜在変数２０４を入力として生成させたい属性及び属性値４０１を持つデータ４０５（例えば文字データ）を再構成する。次にステップＳ３４では、データ出力部６１７は、デコード部６１６（デコーダ２０５）によって再構成されたデータ４０５を出力する。 First, in step S31, the data generation unit 614 receives input from the user of attributes and attribute values 401 to be generated. Next, in step S32, the latent variable calculation unit 615 calculates and outputs the latent variable 204 from the attributes and attribute values 401 and the partial regression coefficients 209 received as input in step S31. Next, in step S33, the decoding unit 616 (decoder 205) uses the latent variable 204 calculated in step S32 as input to reconstruct data 405 (for example, character data) having the attribute and attribute value 401 to be generated. Next, in step S34, the data output section 617 outputs the data 405 reconstructed by the decoding section 616 (decoder 205).

なお、潜在変数算出部６１５は、ステップＳ３１で入力を受け付けた属性及び属性値４０１に該当する基準データ２０１が存在する場合には、ステップＳ３２をスキップし、ステップＳ３３でこの基準データ２０１に対応するデータを再構成したデータ４０５とする。 Note that, if there is reference data 201 that corresponds to the attribute and attribute value 401 input in step S31, the latent variable calculation unit 615 skips step S32 and calculates the value corresponding to this reference data 201 in step S33. It is assumed that the data is reconstructed data 405.

（実施形態１に係るデータ品質評価処理）
図１２は、実施形態１に係るデータ品質評価処理を示すフローチャートである。データ品質評価処理は、データ品質評価部６１２（図８）によって、ユーザ指示を契機として実行される。 (Data quality evaluation process according to Embodiment 1)
FIG. 12 is a flowchart showing data quality evaluation processing according to the first embodiment. The data quality evaluation process is executed by the data quality evaluation unit 612 (FIG. 8) in response to a user instruction.

ステップＳ４１では、データ品質評価部６１２は、属性値付与部６０８によって出力された属性及び属性値６１１に関して、例えば上述の（１）データ設計の十分性、（２）データセットの被覆性、（３）データセットの均一性の少なくとも一つの観点で評価する。次にステップＳ４２では、データ品質評価部６１２は、ステップＳ４１のデータ品質評価結果６１３を出力する。 In step S41, the data quality evaluation unit 612 evaluates, for example, the above-mentioned (1) sufficiency of data design, (2) coverage of the data set, (3) regarding the attributes and attribute values 611 output by the attribute value assigning unit 608. ) Assess at least one aspect of the homogeneity of the dataset. Next, in step S42, the data quality evaluation unit 612 outputs the data quality evaluation result 613 of step S41.

（属性及び属性値の出力例１）
図１３は、属性及び属性値の出力例１（データに対する属性及び属性値）を示す図である。図１３は、属性値付与部６０８の属性値推定部６０９（図８）によって、例えば図２又は図３に示す属性値が付与されていなかったデータに属性値が付与され、属性及び属性値出力部６１０によって出力されたものである。 (Example 1 of output of attributes and attribute values)
FIG. 13 is a diagram showing an output example 1 of attributes and attribute values (attributes and attribute values for data). FIG. 13 shows that the attribute value estimating unit 609 (FIG. 8) of the attribute value assigning unit 608 assigns an attribute value to the data to which the attribute value shown in FIG. 2 or 3 was not assigned, and outputs the attribute and attribute value. 610.

（属性及び属性値の出力例２）
図１４は、データ、属性、及び属性値の出力例２（各属性及び属性値に対するデータ数）を示す図である。図１４は、図１３の表示方法を変えた出力例である。図１４は、属性値付与部６０８の属性及び属性値出力部６１０（図８）によって出力される属性毎の属性値のヒストグラムである。この表示によって、属性毎に例えば上述の（２）データセットの被覆性や（３）データの均一性を確認できる。（２）データセットの被覆性は、図１４のヒストグラムの各属性の属性値が所定の広い範囲に分布しかつ各度数が何れも所定数以上であることで充足されると考えられる。（３）データの均一性は、図１４のヒストグラムの各属性の属性値が所定の広い範囲に均等に分布していることで充足されると考えられる。このような分析によって、属性値に対して不足しているデータを確認することが可能となる。 (Example 2 of output of attributes and attribute values)
FIG. 14 is a diagram showing an output example 2 of data, attributes, and attribute values (the number of data for each attribute and attribute value). FIG. 14 is an example of output obtained by changing the display method of FIG. 13. FIG. 14 is a histogram of attributes of the attribute value assigning unit 608 and attribute values for each attribute output by the attribute value output unit 610 (FIG. 8). With this display, it is possible to check, for example, the above-mentioned (2) data set coverage and (3) data uniformity for each attribute. (2) Coverage of the data set is considered to be satisfied if the attribute values of each attribute in the histogram in FIG. 14 are distributed over a predetermined wide range and each frequency is equal to or greater than a predetermined number. (3) Data uniformity is considered to be satisfied if the attribute values of each attribute in the histogram in FIG. 14 are evenly distributed over a predetermined wide range. Such analysis makes it possible to confirm missing data for attribute values.

例えば図１４のヒストグラム１１０１は、属性１の度数分布を示す。ヒストグラム１１０１は、ヒストグラム１１０２、１１０３と比較してデータの分布範囲が広い又は同等であるが、この分布範囲に存在しない属性値がある。この点でヒストグラム１１０１は、（２）データセットの被覆性が充足されていないと言える。またヒストグラム１１０１は、属性値の分布が均一でない。属性値の分布の均一性は、属性値の分散や標準偏差といったバラつきを表す統計値に基づいて判断できる。この点でヒストグラム１１０１は、（３）データの均一性が充足されていないと言える。 For example, a histogram 1101 in FIG. 14 shows the frequency distribution of attribute 1. Although the histogram 1101 has a data distribution range that is wider or equivalent to that of the histograms 1102 and 1103, there are attribute values that do not exist in this distribution range. In this respect, it can be said that the histogram 1101 does not satisfy (2) coverage of the data set. Further, in the histogram 1101, the distribution of attribute values is not uniform. The uniformity of the distribution of attribute values can be determined based on statistical values representing variations such as variance and standard deviation of attribute values. In this respect, it can be said that the histogram 1101 does not satisfy (3) data uniformity.

また図１４のヒストグラム１１０２は、属性２の度数分布を示す。ヒストグラム１１０２は、ヒストグラム１１０１、１１０３と比較してデータの分布範囲が狭く、この分布範囲に存在しない属性値がある。この点でヒストグラム１１０２は、（２）データセットの被覆性が充足されていないと言える。またヒストグラム１１０２は、属性値の分布が均一でない。この点でヒストグラム１１０２は、（３）データの均一性が充足されていないと言える。 Further, a histogram 1102 in FIG. 14 shows the frequency distribution of attribute 2. Histogram 1102 has a narrower data distribution range than histograms 1101 and 1103, and there are attribute values that do not exist in this distribution range. In this respect, it can be said that the histogram 1102 does not satisfy (2) coverage of the data set. Further, in the histogram 1102, the distribution of attribute values is not uniform. In this respect, it can be said that the histogram 1102 does not satisfy (3) data uniformity.

また図１４のヒストグラム１１０３は、属性Ｊの度数分布を示す。ヒストグラム１１０３は、ヒストグラム１１０１、１１０２と比較してデータの分布範囲が広い又は同等であるが、この分布範囲に存在しない属性値がある。この点でヒストグラム１１０３は、（２）データセットの被覆性が充足されていないと言える。またヒストグラム１１０３は、ヒストグラム１１０１、１１０２と比較して属性値の分布が均一でない。この点でヒストグラム１１０２は、（３）データの均一性が充足されていないと言える。 Further, a histogram 1103 in FIG. 14 shows the frequency distribution of attribute J. Although the histogram 1103 has a data distribution range that is wider than or equivalent to that of the histograms 1101 and 1102, there are attribute values that do not exist in this distribution range. In this respect, it can be said that the histogram 1103 does not satisfy (2) coverage of the data set. Furthermore, the distribution of attribute values in the histogram 1103 is not uniform compared to the histograms 1101 and 1102. In this respect, it can be said that the histogram 1102 does not satisfy (3) data uniformity.

なお、図１４の各グラフを、「データ数」に代えて「各属性値が基準データ２０１及び評価データ２０２の各属性において出現するデータ含有率」を縦軸とするグラフとしてもよい。 Note that each graph in FIG. 14 may be a graph in which the vertical axis is the "data content rate at which each attribute value appears in each attribute of the reference data 201 and the evaluation data 202" instead of the "number of data."

［実施形態１の効果］
本実施形態では、訓練データやテストデータの属性をユーザが明示的に指定し、定量的な属性値で表すことにより、ユーザにとって解釈性の高い属性分析が可能となる。このため、訓練データやテストデータの中に不足しているデータや、誤判別の多いデータの特徴を発見しやすい。 [Effects of Embodiment 1]
In this embodiment, the user explicitly specifies attributes of training data and test data and expresses them with quantitative attribute values, thereby enabling attribute analysis with high interpretability for the user. Therefore, it is easy to discover missing data in training data or test data, or characteristics of data that are often misclassified.

また本実施形態では、従来技術のように、得られた潜在変数の持つ属性をユーザが解釈する（偏在変数は太さ又は角度に依存する等）ではなく、ユーザが潜在変数に持たせるべき属性を明示的に指定できるため、ユーザの意図に従った属性分析が可能となる。 In addition, in this embodiment, instead of the user interpreting the attributes of the obtained latent variable as in the prior art (unevenly distributed variables depend on the thickness or angle, etc.), the user interprets the attributes that the latent variable should have. Since the attributes can be specified explicitly, attribute analysis can be performed according to the user's intention.

また、従来技術では属性が定性的にしか分からないため、異なるデータセットや異なるモデルで学習したデータ間の属性は比較できなかった。しかし、本実施形態ではでは属性値が定量的に求まるため、異なるデータセットや異なるモデルで学習したデータ間の属性の比較が可能となる。 In addition, with conventional technology, attributes can only be known qualitatively, making it impossible to compare attributes between data learned using different datasets or different models. However, in this embodiment, attribute values are determined quantitatively, so it is possible to compare attributes between data learned using different data sets or different models.

また、本実施形態では、属性値の予測に限定した場合であっても、教師あり学習による回帰モデルを使って属性値を推定する場合よりも少ないデータ量又は学習量で属性値を付与できる。 Furthermore, in the present embodiment, even when the prediction of attribute values is limited, attribute values can be assigned with a smaller amount of data or learning than when estimating attribute values using a regression model based on supervised learning.

また、本実施形態では、データ生成では、複数の属性に対する属性値を指定してデータ生成できるため、必要とされるデータを容易に生成できる。また、データ生成の際に、指定された属性及び属性値に該当する基準データが存在する場合にはこの基準データに対応するデータを再構成したデータとして採用する。これにより、属性及び属性値と偏回帰係数から潜在変数を算出しデコーダでデコードしてデータを再構成する場合と比較して、速やかにデータを再構成できる。 Further, in the present embodiment, in data generation, data can be generated by specifying attribute values for a plurality of attributes, so necessary data can be easily generated. Further, when generating data, if reference data corresponding to the specified attribute and attribute value exists, the data corresponding to this reference data is adopted as the reconstructed data. Thereby, data can be quickly reconstructed compared to the case where latent variables are calculated from attributes, attribute values, and partial regression coefficients, and data is reconstructed by decoding with a decoder.

［実施形態２］
実施形態１では、１つの情報処理システム１を用いてモデル学習（図４、図９）、属性値付与及び属性と属性値の関係の出力（図５、図１０）、及びデータ品質評価処理（図１２）を実行する例を示した。しかし、モデル学習、属性値付与、及びデータ品質評価処理は、図１５に示す複数の情報処理システム１（１‐１，１－２，…，１－ｎ）で並列に実行されてもよい。 [Embodiment 2]
In the first embodiment, one information processing system 1 is used to perform model learning (FIGS. 4 and 9), attribute value assignment and output of the relationship between attributes and attribute values (FIGS. 5 and 10), and data quality evaluation processing ( An example of executing Fig. 12) is shown. However, model learning, attribute value assignment, and data quality evaluation processing may be executed in parallel by a plurality of information processing systems 1 (1-1, 1-2, . . . , 1-n) shown in FIG. 15.

例えばモデル学習を複数の情報処理システム１で実行する場合、ステップＳ１３～Ｓ１６（図９）を、複数の情報処理システム１毎にそれぞれ異なる基準データ２０１を含む入力データを用いて実行してもよい。そして、複数の情報処理システム１の少なくとも１つが、各情報処理システム１によって得られたＭＦＣＶＡＥモデルの学習結果をマージして出力する。 For example, when model learning is executed in a plurality of information processing systems 1, steps S13 to S16 (FIG. 9) may be executed using input data including different reference data 201 for each of the plurality of information processing systems 1. . Then, at least one of the plurality of information processing systems 1 merges and outputs the learning results of the MFCVAE model obtained by each information processing system 1.

また複数の情報処理システム１毎に得られたＭＦＣＶＡＥのモデルのそれぞれの学習結果に基づいて、情報処理システム１毎に各入力データへの属性値付与及び属性と属性値の関係の出力（ステップＳ２１～Ｓ２３（図１０））を実行してもよい。そして、複数の情報処理システム１の少なくとも１つが、各情報処理システム１によって得られた属性と属性値の関係（図１４）をマージして出力する。 Furthermore, based on the learning results of the MFCVAE model obtained for each of the plurality of information processing systems 1, each information processing system 1 assigns attribute values to each input data and outputs the relationship between attributes and attribute values (step S21 ~S23 (FIG. 10)) may be executed. Then, at least one of the plurality of information processing systems 1 merges and outputs the relationship between attributes and attribute values (FIG. 14) obtained by each information processing system 1.

本実施形態では、従来技術と比較して、潜在変数が定量的に求まるため、モデル毎に別システムで並列処理しても計算結果をマージできることから、複数システムでモデル学習、属性値付与、及び属性と属性値の関係の出力の各処理の負荷分散が可能となる。よって、従来と比較して短い時間で、これらの処理を完了させ、必要とされるデータを生成することができる。 In this embodiment, compared to the conventional technology, latent variables are determined quantitatively, so even if each model is processed in parallel in a separate system, the calculation results can be merged, so multiple systems can perform model learning, attribute value assignment, and It becomes possible to distribute the load of each process for outputting the relationship between attributes and attribute values. Therefore, these processes can be completed and necessary data can be generated in a shorter time than in the past.

（実施形態の適用例）
実施形態は、上述のように手書き文字の文字認識に適用できる。その他、実施形態は、全てのデータにラベル付与するのが困難であり、一部のデータのみ正確な属性値（ラベル）が付与されており、残りのデータに属性値（ラベル）を付与したいといったケースであれば適用できる。 (Application example of embodiment)
Embodiments can be applied to character recognition of handwritten characters as described above. In addition, in the embodiment, it is difficult to label all data, and only some data is assigned accurate attribute values (labels), and it is desired to assign attribute values (labels) to the remaining data. Applicable if the case.

例えば、工場設備の振動データに対する回転数のラベルの付与がある。前提として過去に取得した工場設備の振動データには回転数のラベルが付与されておらず、新たに回転数が計測できる装置を導入し、過去に取得した工場設備のデータに対して回転数のラベルを付与するような場合である。 For example, a rotation speed label may be attached to vibration data of factory equipment. The premise is that the vibration data of factory equipment acquired in the past is not labeled with the rotation speed, so a new device that can measure the rotation speed will be introduced, and the vibration data of the factory equipment acquired in the past will be compared with the rotation speed. This is the case when adding a label.

また、画像における被写体の角度予測がある。角度のラベル付与された少量のデータから、未知の画像の被写体の角度を予測する場合である。この適用例は、ロボットが物をつかむときの把持の方向制御等に利用できる。 There is also prediction of the angle of a subject in an image. This is a case where the angle of a subject in an unknown image is predicted from a small amount of data labeled with the angle. This application example can be used to control the direction of grasping when a robot grasps an object.

また、楽曲の印象評価を行う場合がある。予めユーザが評価した楽曲の印象（楽しい、悲しい、うれしい、寂しいなど）から、未知の楽曲の印象のラベル付与を行うことができる。 In addition, an impression evaluation of the music may be performed. It is possible to label the impression of an unknown song based on the impression of the song (fun, sad, happy, lonely, etc.) evaluated by the user in advance.

また、学会論文の研究分野の可視化を行う場合がある。予め各分野との関係度（画像認識分野との関連度が３０、強化学習分野との関連度が５０、・・・）が分かっている論文を基に、未知の論文の各分野との関係度を推定する場合である。 In addition, the research fields of academic papers may be visualized. Based on papers for which the degree of relationship with each field is known in advance (the degree of relationship with the image recognition field is 30, the degree of relationship with the reinforcement learning field is 50, etc.), the relationship of unknown papers with each field is calculated. This is a case of estimating the degree.

（コンピュータ１０００のハードウェア）
図１６は、コンピュータ１０００の構成を示すハードウェア図である。例えば、情報処理システム１、あるいは特徴量抽出部６０３、属性値付与部６０８、データ生成部６１４、及びデータ品質評価部６１２等の情報処理システム１を適宜分散した各システムは、コンピュータ１０００によって実現される。 (Hardware of computer 1000)
FIG. 16 is a hardware diagram showing the configuration of computer 1000. For example, the information processing system 1 or each system in which the information processing system 1 such as the feature amount extraction unit 603, the attribute value assignment unit 608, the data generation unit 614, and the data quality evaluation unit 612 is appropriately distributed is realized by the computer 1000. Ru.

コンピュータ１０００は、バス等の内部通信線１００９を介して相互に接続されたＣＰＵをはじめとするプロセッサ１００１、主記憶装置１００２、補助記憶装置１００３、ネットワークインタフェース１００４、入力装置１００５、及び出力装置１００６を備える。 The computer 1000 includes a processor 1001 including a CPU, a main storage device 1002, an auxiliary storage device 1003, a network interface 1004, an input device 1005, and an output device 1006, which are interconnected via an internal communication line 1009 such as a bus. Be prepared.

プロセッサ１００１は、コンピュータ１０００全体の動作制御を司る。また主記憶装置１００２は、例えば揮発性の半導体メモリから構成され、プロセッサ１００１のワークメモリとして利用される。補助記憶装置１００３は、非一時的記憶媒体の一例であり、ハードディスク装置、ＳＳＤ（Solid State Drive）、又はフラッシュメモリ等の大容量の不揮発性の記憶装置から構成され、各種プログラムやデータを長期間保持するために利用される。 A processor 1001 controls the overall operation of the computer 1000. Further, the main storage device 1002 is composed of, for example, a volatile semiconductor memory, and is used as a work memory of the processor 1001. The auxiliary storage device 1003 is an example of a non-temporary storage medium, and is composed of a large-capacity nonvolatile storage device such as a hard disk device, SSD (Solid State Drive), or flash memory, and stores various programs and data for a long period of time. used for holding.

補助記憶装置１００３に格納された実行可能プログラム１１００がコンピュータ１０００の起動時や必要時に主記憶装置１００２にロードされ、主記憶装置１００２にロードされた実行可能プログラム１１００をプロセッサ１００１が実行する。これにより、各種処理を実行するシステムが実現される。 An executable program 1100 stored in the auxiliary storage device 1003 is loaded into the main storage device 1002 when the computer 1000 is started or when necessary, and the processor 1001 executes the executable program 1100 loaded into the main storage device 1002. This realizes a system that executes various processes.

なお、実行可能プログラム１１００は、非一時的記録媒体に記録され、媒体読み取り装置によって非一時的記録媒体から読み出されて、主記憶装置１００２にロードされてもよい。または、実行可能プログラム１１００は、ネットワークを介して外部のコンピュータから取得されて、主記憶装置１００２にロードされてもよい。 Note that the executable program 1100 may be recorded on a non-temporary recording medium, read from the non-temporary recording medium by a medium reading device, and loaded into the main storage device 1002. Alternatively, executable program 1100 may be obtained from an external computer via a network and loaded into main storage 1002.

ネットワークインタフェース１００４は、コンピュータ１０００をシステム内の各ネットワークに接続する、あるいは他のコンピュータと通信するためのインタフェース装置である。ネットワークインタフェース１００４は、例えば、有線ＬＡＮ（Local Area Network）や無線ＬＡＮ等のＮＩＣ（Network Interface Card）から構成される。 The network interface 1004 is an interface device for connecting the computer 1000 to each network in the system or for communicating with other computers. The network interface 1004 includes, for example, a NIC (Network Interface Card) such as a wired LAN (Local Area Network) or a wireless LAN.

入力装置１００５は、キーボードや、マウス等のポインティングデバイス等から構成され、ユーザがコンピュータ１０００に各種指示や情報を入力するために利用される。出力装置１００６は、例えば、液晶ディスプレイ又は有機ＥＬ（Electro Luminescence）ディスプレイ等の表示装置や、スピーカ等の音声出力装置から構成され、必要時に必要な情報をユーザに提示するために利用される。 The input device 1005 includes a keyboard, a pointing device such as a mouse, and is used by the user to input various instructions and information to the computer 1000. The output device 1006 includes, for example, a display device such as a liquid crystal display or an organic EL (Electro Luminescence) display, and an audio output device such as a speaker, and is used to present necessary information to the user when necessary.

なお、本発明は前述した実施形態に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。例えば、前述した実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施形態の構成の一部を他の実施形態の構成に置き換えてもよい。また、ある実施形態の構成に他の実施形態の構成を加えてもよい。また、各実施形態の構成の一部について、他の構成の追加、削除、又は置換をしてもよい。 Note that the present invention is not limited to the embodiments described above, and includes various modifications and equivalent configurations within the scope of the appended claims. For example, the embodiments described above have been described in detail to explain the present invention in an easy-to-understand manner, and the present invention is not necessarily limited to having all the configurations described. Further, a part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Further, the configuration of one embodiment may be added to the configuration of another embodiment. Furthermore, other configurations may be added, deleted, or replaced with some of the configurations of each embodiment.

また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよい。あるいは、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 Further, each of the configurations, functions, processing units, processing means, etc. described above may be partially or entirely realized in hardware by, for example, designing an integrated circuit. Alternatively, the functions may be implemented in software by having a processor interpret and execute programs that implement the respective functions.

各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、ＳＳＤ（Solid State Drive）等の記憶装置、又は、ＩＣ（Integrated Circuit）カード、ＳＤカード、ＤＶＤ（Digital Versatile Disc）の非一時的記録媒体に格納することができる。 Information such as programs, tables, files, etc. that realize each function is stored in storage devices such as memory, hard disks, SSDs (Solid State Drives), or non-removable devices such as IC (Integrated Circuit) cards, SD cards, and DVDs (Digital Versatile Discs). It can be stored on a temporary storage medium.

また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 Furthermore, the control lines and information lines shown are those considered necessary for explanation, and do not necessarily show all the control lines and information lines necessary for implementation. In reality, almost all configurations can be considered interconnected.

１：情報処理システム、２０１：基準データ、２０２：評価データ、２０４：潜在変数、６０３：特徴量抽出部、６０８：属性値付与部、６１２：データ品質評価部、６１４：データ生成部、１０００：コンピュータ。
1: information processing system, 201: reference data, 202: evaluation data, 204: latent variable, 603: feature extraction section, 608: attribute value assignment section, 612: data quality evaluation section, 614: data generation section, 1000: Computer.

Claims

処理部と記憶部とを有する情報処理システムが実行する情報処理方法であって、
前記処理部が、
データの複数の属性に属性値が付与されている基準データを含んだ入力データを、該データの前記複数の属性のそれぞれに関する潜在変数を出力するＭＦＣＶＡＥ（Multi-Facet Clustering Variational Auto-Encoder）に入力する第１ステップと、
前記入力データの入力に対して前記ＭＦＣＶＡＥから出力された前記潜在変数を説明変数とし、前記属性値を目的変数とする回帰モデルを前記属性毎に設定する第２ステップと、
前記潜在変数と前記属性値とから、該属性値に対する予測誤差が最小となる前記属性値の予測値及び前記回帰モデルの回帰係数を前記属性毎に算出する第３ステップと、
前記第３ステップによって算出された前記属性毎の前記予測値及び前記回帰係数に基づいて、前記潜在変数及び前記属性値の前記回帰モデルへの適合が良いほど小さい値を取る指標を前記属性毎に算出する第４ステップと、
前記ＭＦＣＶＡＥによるデータ再構成の誤差を表す再構成誤差項と、前記潜在変数の分布に制約を与える正則化項と、を有する前記ＭＦＣＶＡＥの損失関数に、前記属性毎の前記指標に基づく追加項を追加した損失関数の関数値を算出する第５ステップと、
前記第５ステップによって算出された前記関数値に基づく誤差逆伝搬によって前記ＭＦＣＶＡＥのモデルパラメータを更新する第６ステップと、を実行し、
前記第１ステップから前記第６ステップまでを、前記予測誤差又はエポック回数が所定条件を充足するまでこの順序で繰り返すことで前記ＭＦＣＶＡＥのモデル学習を実行する、ことを特徴とする情報処理方法。 An information processing method executed by an information processing system having a processing unit and a storage unit, the method comprising:
The processing unit,
Input data including reference data in which attribute values are assigned to multiple attributes of the data is input to an MFCVAE (Multi-Facet Clustering Variational Auto-Encoder) that outputs latent variables for each of the multiple attributes of the data. The first step is to
a second step of setting a regression model for each attribute, using the latent variable output from the MFCVAE as an explanatory variable and the attribute value as an objective variable in response to the input data;
a third step of calculating, for each attribute, a predicted value of the attribute value and a regression coefficient of the regression model that minimizes a prediction error for the attribute value from the latent variable and the attribute value;
Based on the predicted value and the regression coefficient for each attribute calculated in the third step, an index that takes a smaller value as the latent variable and the attribute value fit the regression model better is determined for each attribute. A fourth step of calculating;
An additional term based on the index for each attribute is added to the loss function of the MFCVAE, which has a reconstruction error term representing an error in data reconstruction by the MFCVAE, and a regularization term that constrains the distribution of the latent variable. a fifth step of calculating a function value of the added loss function;
a sixth step of updating the model parameters of the MFCVAE by error backpropagation based on the function value calculated in the fifth step;
An information processing method characterized in that the MFCVAE model learning is executed by repeating the first step to the sixth step in this order until the prediction error or the number of epochs satisfies a predetermined condition.

請求項１に記載の情報処理方法であって、
前記回帰モデルは、重回帰モデルである、ことを特徴とする情報処理方法。 The information processing method according to claim 1,
An information processing method characterized in that the regression model is a multiple regression model.

請求項１に記載の情報処理方法であって、
前記属性毎の前記指標は、前記回帰モデルの決定係数である、ことを特徴とする情報処理方法。 The information processing method according to claim 1,
An information processing method characterized in that the index for each attribute is a coefficient of determination of the regression model.

請求項１に記載の情報処理方法であって、
前記属性毎の前記指標は、前記予測誤差である、ことを特徴とする情報処理方法。 The information processing method according to claim 1,
An information processing method characterized in that the index for each attribute is the prediction error.

請求項４に記載の情報処理方法であって、
前記予測誤差は、平均二乗誤差である、ことを特徴とする情報処理方法。 The information processing method according to claim 4,
An information processing method characterized in that the prediction error is a mean square error.

請求項１に記載の情報処理方法であって、
前追加項は、前記属性毎の前記指標に、前記属性毎の重み係数を乗算した項であり、
前記処理部が、
前記属性毎の前記重み係数を、前記属性毎に、前記指標と前記再構成誤差項及び前記正則化項との各絶対値のオーダーが等しくなるように決定する、ことを特徴とする情報処理方法。 The information processing method according to claim 1,
The pre-added term is a term obtained by multiplying the index for each attribute by a weighting coefficient for each attribute,
The processing unit,
An information processing method characterized in that the weighting coefficient for each attribute is determined for each attribute so that the orders of the absolute values of the index, the reconstruction error term, and the regularization term are equal. .

請求項１に記載の情報処理方法であって、
前記入力データは、前記基準データ及び前記属性に前記属性値が付与されていない評価データを含み、
前記処理部が、前記第５ステップにおいて、
前記追加項を、前記基準データを用いて計算し、
前記再構成誤差項及び前記正則化項を、前記基準データ及び前記評価データの何れか一方又は両方を用いて計算する、ことを特徴とする情報処理方法。 The information processing method according to claim 1,
The input data includes the reference data and evaluation data in which the attribute value is not assigned to the attribute,
The processing unit, in the fifth step,
calculating the additional term using the reference data;
An information processing method characterized in that the reconstruction error term and the regularization term are calculated using either or both of the reference data and the evaluation data.

請求項１に記載の情報処理方法であって、
前記入力データは、前記基準データ及び前記属性に前記属性値が付与されていない評価データを含み、
前記処理部が、
前記第１ステップから前記第６ステップまでを繰り返すことでモデル学習済みの前記ＭＦＣＶＡＥに前記評価データを入力し、前記評価データに関する前記潜在変数を取得する第７ステップと、
前記第７ステップによって取得された前記潜在変数と、前記ＭＦＣＶＡＥのモデル学習の最終エポック時における前記回帰係数とに基づいて、前記評価データの前記属性値が付与されていない前記属性の前記属性値の予測値を算出して該評価データに該属性値として付与する第８ステップと、を実行することを特徴とする情報処理方法。 The information processing method according to claim 1,
The input data includes the reference data and evaluation data in which the attribute value is not assigned to the attribute,
The processing unit,
a seventh step of inputting the evaluation data into the MFCVAE that has undergone model learning by repeating the first step to the sixth step, and acquiring the latent variables related to the evaluation data;
Based on the latent variable obtained in the seventh step and the regression coefficient at the final epoch of model learning of the MFCVAE, calculate the attribute value of the attribute to which the attribute value of the evaluation data is not assigned. An information processing method, comprising: calculating a predicted value and assigning the predicted value to the evaluation data as the attribute value.

請求項８に記載の情報処理方法であって、
前記処理部が、
前記基準データと前記第８ステップによって前記予測値が付与された前記評価データとの前記属性及び前記属性値に関する情報を出力する第９ステップ、を実行することを特徴とする情報処理方法。 The information processing method according to claim 8,
The processing unit,
An information processing method characterized by executing a ninth step of outputting information regarding the attributes and attribute values of the reference data and the evaluation data to which the predicted value has been assigned in the eighth step.

請求項８に記載の情報処理方法であって、
前記処理部が、
前記基準データと前記第８ステップによって前記予測値が付与された前記評価データとを用いて、データの設計の十分性、データの被覆性、又はデータの均一性を含む観点に従って前記入力データを評価する第１０ステップ、を実行することを特徴とする情報処理方法。 The information processing method according to claim 8,
The processing unit,
Evaluate the input data according to a viewpoint including the sufficiency of data design, data coverage, or data uniformity, using the reference data and the evaluation data to which the predicted value has been assigned in the eighth step. An information processing method characterized by performing a tenth step.

請求項１に記載の情報処理方法であって、
複数の前記情報処理システムの各前記処理部が、
前記第１ステップから前記第６ステップまでを、前記予測誤差又はエポック回数が所定条件を充足するまで繰り返すことで前記ＭＦＣＶＡＥをモデル学習することを、それぞれ異なる前記入力データを用いて実行し、
各前記処理部によって得られた前記ＭＦＣＶＡＥのモデルの学習結果をマージして出力する第１１ステップ、を実行することを特徴とする情報処理方法。 The information processing method according to claim 1,
Each of the processing units of the plurality of information processing systems,
performing model learning of the MFCVAE by repeating the first step to the sixth step until the prediction error or the number of epochs satisfies a predetermined condition, using different input data;
An information processing method characterized by executing an eleventh step of merging and outputting the learning results of the MFCVAE model obtained by each of the processing units.

請求項８に記載の情報処理方法であって、
複数の前記情報処理システムの各前記処理部が、
前記第１ステップから前記第８ステップまでを、それぞれ異なる前記入力データを用いて実行し、
各前記処理部によって得られた、前記基準データと前記第８ステップによって前記予測値が付与された前記評価データの前記属性及び前記属性値に関する情報をマージして出力する第１２ステップ、を実行することを特徴とする情報処理方法。 The information processing method according to claim 8,
Each of the processing units of the plurality of information processing systems,
Performing the first step to the eighth step using different input data,
performing a twelfth step of merging and outputting information regarding the attribute and the attribute value of the reference data obtained by each of the processing units and the evaluation data to which the predicted value has been assigned in the eighth step; An information processing method characterized by:

請求項１に記載の情報処理方法であって、
前記処理部が、
指定された前記属性及び前記属性値を、前記第１ステップから前記第６ステップまでを繰り返すことでモデル学習済みの前記ＭＦＣＶＡＥに入力し、入力された前記属性及び前記属性値と前記回帰係数とから前記潜在変数を算出し、該潜在変数を基に、入力された前記属性及び前記属性値に対応する前記データを再構成する第１３ステップ、を実行することを特徴とする情報処理方法。 The information processing method according to claim 1,
The processing unit,
The specified attributes and attribute values are input into the model-trained MFCVAE by repeating the first step to the sixth step, and from the input attributes and attribute values and the regression coefficients. An information processing method characterized by executing a thirteenth step of calculating the latent variable and reconstructing the data corresponding to the input attribute and attribute value based on the latent variable.

請求項１３に記載の情報処理方法であって、
前記処理部が、
前記指定された前記属性及び前記属性値に該当する前記基準データが存在する場合には、該基準データに対応する前記データを再構成したデータとして採用し、
前記指定された前記属性及び前記属性値に該当する前記基準データが存在しない場合に、前記第１３ステップを実行する、ことを特徴とする情報処理方法。 The information processing method according to claim 13,
The processing unit,
If the reference data corresponding to the specified attribute and attribute value exists, the data corresponding to the reference data is adopted as reconstructed data,
An information processing method characterized in that the thirteenth step is executed when the reference data corresponding to the specified attribute and attribute value does not exist.

請求項１に記載の情報処理方法であって、
前記基準データは活字及び手書き文字を含み、前記評価データは手書き文字を含む、ことを特徴とする情報処理方法。 The information processing method according to claim 1,
An information processing method characterized in that the reference data includes printed characters and handwritten characters, and the evaluation data includes handwritten characters.

データの複数の属性に属性値が付与されている基準データを含んだ入力データを、該データの前記複数の属性のそれぞれに関する潜在変数を出力するＭＦＣＶＡＥ（Multi-Facet Clustering Variational Auto-Encoder）に入力し、
前記入力データの入力に対して前記ＭＦＣＶＡＥから出力された前記潜在変数を説明変数とし、前記属性値を目的変数とする回帰モデルを前記属性毎に設定し、
前記潜在変数と前記属性値とから、該属性値に対する予測誤差が最小となる前記属性値の予測値及び前記回帰モデルの回帰係数を前記属性毎に算出し、
算出された前記属性毎の前記予測値及び前記回帰係数に基づいて、前記潜在変数及び前記属性値の前記回帰モデルへの適合が良いほど小さい値を取る指標を前記属性毎に算出する回帰モデル適合度評価部と、
前記ＭＦＣＶＡＥによるデータ再構成の誤差を表す再構成誤差項と、前記潜在変数の分布に制約を与える正則化項と、を有する前記ＭＦＣＶＡＥの損失関数に、前記属性毎の前記指標に基づく追加項を追加した損失関数の関数値を算出する損失算出部と、
前記損失算出部によって算出された前記関数値に基づく誤差逆伝搬によって前記ＭＦＣＶＡＥのモデルパラメータを更新するモデル更新部と、を有し、
前記回帰モデル適合度評価部、前記損失算出部、及び前記モデル更新部は、前記予測誤差又はエポック回数が所定条件を充足するまでこの順序で処理を順次繰り返すことで前記ＭＦＣＶＡＥのモデル学習を実行する、ことを特徴とする情報処理システム。 Input data including reference data in which attribute values are assigned to multiple attributes of the data is input to an MFCVAE (Multi-Facet Clustering Variational Auto-Encoder) that outputs latent variables for each of the multiple attributes of the data. death,
setting a regression model for each attribute, using the latent variable output from the MFCVAE as an explanatory variable and the attribute value as an objective variable in response to the input data;
From the latent variable and the attribute value, calculate for each attribute a predicted value of the attribute value and a regression coefficient of the regression model that minimizes the prediction error for the attribute value,
Regression model adaptation that calculates, for each attribute, an index that takes a smaller value as the latent variable and the attribute value fit better to the regression model, based on the calculated predicted value and regression coefficient for each attribute. degree evaluation department,
An additional term based on the index for each attribute is added to the loss function of the MFCVAE, which has a reconstruction error term representing an error in data reconstruction by the MFCVAE, and a regularization term that constrains the distribution of the latent variable. a loss calculation unit that calculates a function value of the added loss function;
a model updating unit that updates model parameters of the MFCVAE by error backpropagation based on the function value calculated by the loss calculation unit;
The regression model fitness evaluation unit, the loss calculation unit, and the model update unit execute the MFCVAE model learning by sequentially repeating the process in this order until the prediction error or the number of epochs satisfies a predetermined condition. , an information processing system characterized by:

コンピュータを情報処理システムとして機能させるための情報処理プログラムであって、
前記コンピュータを、
データの複数の属性に属性値が付与されている基準データを含んだ入力データを、該データの前記複数の属性のそれぞれに関する潜在変数を出力するＭＦＣＶＡＥ（Multi-Facet Clustering Variational Auto-Encoder）に入力し、
前記入力データの入力に対して前記ＭＦＣＶＡＥから出力された前記潜在変数を説明変数とし、前記属性値を目的変数とする回帰モデルを前記属性毎に設定し、
前記潜在変数と前記属性値とから、該属性値に対する予測誤差が最小となる前記属性値の予測値及び前記回帰モデルの回帰係数を前記属性毎に算出し、
算出された前記属性毎の前記予測値及び前記回帰係数に基づいて、前記潜在変数及び前記属性値の前記回帰モデルへの適合が良いほど小さい値を取る指標を前記属性毎に算出する回帰モデル適合度評価部と、
前記ＭＦＣＶＡＥによるデータ再構成の誤差を表す再構成誤差項と、前記潜在変数の分布に制約を与える正則化項と、を有する前記ＭＦＣＶＡＥの損失関数に、前記属性毎の前記指標に基づく追加項を追加した損失関数の関数値を算出する損失算出部と、
前記損失算出部によって算出された前記関数値に基づく誤差逆伝搬によって前記ＭＦＣＶＡＥのモデルパラメータを更新するモデル更新部と、して機能させ、
前記回帰モデル適合度評価部、前記損失算出部、及び前記モデル更新部は、前記予測誤差又はエポック回数が所定条件を充足するまでこの順序で処理を順次繰り返すことで前記ＭＦＣＶＡＥのモデル学習を実行する、ことを特徴とする情報処理プログラム。
An information processing program for making a computer function as an information processing system,
The computer,
Input data including reference data in which attribute values are assigned to multiple attributes of the data is input to an MFCVAE (Multi-Facet Clustering Variational Auto-Encoder) that outputs latent variables for each of the multiple attributes of the data. death,
setting a regression model for each attribute, using the latent variable output from the MFCVAE as an explanatory variable and the attribute value as an objective variable in response to the input data;
From the latent variable and the attribute value, calculate for each attribute a predicted value of the attribute value and a regression coefficient of the regression model that minimizes the prediction error for the attribute value,
Regression model adaptation that calculates, for each attribute, an index that takes a smaller value as the latent variable and the attribute value fit better to the regression model, based on the calculated predicted value and regression coefficient for each attribute. degree evaluation department,
An additional term based on the index for each attribute is added to the loss function of the MFCVAE, which has a reconstruction error term representing an error in data reconstruction by the MFCVAE, and a regularization term that constrains the distribution of the latent variable. a loss calculation unit that calculates a function value of the added loss function;
functioning as a model updating unit that updates model parameters of the MFCVAE by error backpropagation based on the function value calculated by the loss calculation unit;
The regression model fitness evaluation unit, the loss calculation unit, and the model update unit execute the MFCVAE model learning by sequentially repeating the process in this order until the prediction error or the number of epochs satisfies a predetermined condition. , an information processing program characterized by: