JP2018160200A

JP2018160200A - Method for learning neural network, neural network learning program, and neural network learning program

Info

Publication number: JP2018160200A
Application number: JP2017058352A
Authority: JP
Inventors: 匠檀上; Takumi Danjo; 雅文山崎; Masafumi Yamazaki
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2018-10-11

Abstract

PROBLEM TO BE SOLVED: To efficiently complete learning steps.SOLUTION: Provided is an NN learning method for optimizing the NN parameter of a neural network (hereafter NN) using teacher data, comprising: a learning step which, when the teacher data is inputted to an NN where a first NN parameter is set, updates the NN parameter to a second NN parameter obtained by subtracting the gradient of error function between NN output and correct answer value multiplied by a learning rate from the first NN parameter; an evaluation step for inputting evaluation data to the NN where the second NN parameter is set and finding the accuracy of NN output (correct answer rate, Loss); A step for storing the second NN parameter (w) when the accuracy of NN output is best; and further a step for reverting the NN parameter to the stored NN parameter and lowering the learning rate when a first state is entered where the accuracy of NN output is unimprovable. When the first state is entered, the learning steps are resumed with a lowered learning rate in the NN where the reverted NN parameter is set.SELECTED DRAWING: Figure 7

Description

本発明は，ニューラルネットワークの学習方法、ニューラルネットワークの学習プログラム及びニューラルネットワークの学習装置に関する。 The present invention relates to a neural network learning method, a neural network learning program, and a neural network learning apparatus.

機械学習のモデルであるニューラルネットワーク（Neural network: 以下NNと称する）や、ディープニューラルネットワーク（Deep Neural Network: 以下DNNと称する。）は、教師データを与えられて学習を行う。NNまたはDNN（以下簡単のためにまとめてNNと称する）の学習では、NNに教師データを入力し、NNの演算を実行して出力を得る。そして、出力と教師データの正解値との誤差が少なくようにNNのパラメータを更新する。誤差が許容値未満になるまで収束すれば、学習を終了し、更新されたパラメータをNNに設定する。学習によって最適化されたNNは、処理対象の入力（画像、音声、テキストなど）を与えられると、NNの演算を実行し、出力を算出または推定する。 A neural network (Neural network: hereinafter referred to as NN) and a deep neural network (hereinafter referred to as DNN), which are models of machine learning, perform learning by being given teacher data. In learning of NN or DNN (hereinafter collectively referred to as NN for the sake of simplicity), teacher data is input to NN, and NN operation is executed to obtain an output. Then, the NN parameters are updated so that the error between the output and the correct value of the teacher data is small. If the error converges until it is less than the allowable value, the learning is terminated and the updated parameter is set to NN. The NN optimized by learning, when given the input (image, sound, text, etc.) to be processed, calculates the NN and calculates or estimates the output.

NNの学習方法について、以下の特許文献に記載されている。 The learning method of NN is described in the following patent documents.

特開２００１−５６８０２号公報JP 2001-56802 A

NN,特に近年注目されているDNNは、学習工程が非常に長時間を要する。例えば、画像認識コンテストで使用されるDNNと教師データのセットでは、グラフィックプロセッサなどの汎用プロセッサのアクセレータを使用して演算したとしても、1週間以上を要することが報告されている。 NN, especially DNN, which has been attracting attention in recent years, requires a very long learning process. For example, it has been reported that a DNN and teacher data set used in an image recognition contest takes more than a week even if it is calculated using a general-purpose processor accelerator such as a graphic processor.

一般に、学習工程を短縮する方法として、学習率を初期値は大きくし、誤差関数値が収束するにつれて、徐々に小さくすることが提案される。しかし、単に学習率を徐々に低下させても、正解率や誤差が急に悪化したり学習工程を継続しても改善せず収束できない場合がしばしば発生する。 In general, as a method of shortening the learning process, it is proposed that the learning rate is increased as the initial value and gradually decreased as the error function value converges. However, even if the learning rate is gradually decreased, the correct answer rate or error is often suddenly deteriorated, and there is often a case where the learning rate cannot be converged without improving even if the learning process is continued.

そこで，第1の実施の形態の目的は，学習工程を効率的に完了するニューラルネットワークの学習方法、ニューラルネットワークの学習プログラム及びニューラルネットワークの学習装置を提供することにある。 Accordingly, an object of the first embodiment is to provide a neural network learning method, a neural network learning program, and a neural network learning device that efficiently complete the learning process.

第1の実施の形態は，教師データを使用してニューラルネットワーク（以下NN）のNNパラメータを最適化するNNの学習方法であって、第1のNNパラメータが設定されたNNに前記教師データを入力したときの、前記NNの出力と正解値との誤差関数の勾配に学習率を乗じた値を前記第1のNNパラメータから減じて得た第２のNNパラメータに、前記NNパラメータを更新する学習工程と、前記第２のNNパラメータが設定されたNNに評価データを入力し、前記NNの出力の精度を求める評価工程と、前記NNの出力の精度が最良値の場合、前記第２のNNパラメータを記憶する工程と、更に、前記NNの出力の精度が改善されない第１状態になった場合、前記NNパラメータを前記記憶したNNパラメータに戻すと共に、前記学習率を低下させる工程とを有し、前記第１状態になった場合、前記戻したNNパラメータを設定したNNを、前記低下させた学習率で、前記学習工程を再開する、NNの学習方法である。 The first embodiment is an NN learning method for optimizing an NN parameter of a neural network (hereinafter referred to as NN) using teacher data, wherein the teacher data is stored in an NN in which the first NN parameter is set. The NN parameter is updated to a second NN parameter obtained by subtracting from the first NN parameter a value obtained by multiplying the gradient of the error function between the NN output and the correct answer value by the learning rate when input. A learning step, an evaluation step of inputting evaluation data to an NN in which the second NN parameter is set, and obtaining an accuracy of the output of the NN; and when the accuracy of the output of the NN is the best value, the second Storing the NN parameter, and, further, returning the NN parameter to the stored NN parameter and lowering the learning rate when the output state of the NN is not improved. And the first state was reached In this case, the learning process is resumed by restarting the learning process with the reduced learning rate for the NN in which the returned NN parameter is set.

第１の実施の形態によれば，ニューラルネットワークの学習工程を効率的に完了することができる。 According to the first embodiment, the neural network learning process can be completed efficiently.

本実施の形態におけるDNN学習装置であるDNN装置の構成を示す図である。It is a figure which shows the structure of the DNN apparatus which is a DNN learning apparatus in this Embodiment. DNNの構成例を示す図である。It is a figure which shows the structural example of DNN. DNN内に含まれる３層構造のネットワークの例を示す図である。It is a figure which shows the example of the network of the 3 layer structure contained in DNN. ＤＮＮの学習方法の概略を示すフローチャート図である。It is a flowchart figure which shows the outline of the learning method of DNN. 勾配降下法を説明する図である。It is a figure explaining the gradient descent method. 勾配降下法の問題点について説明する図である。It is a figure explaining the problem of the gradient descent method. 本実施の形態におけるDNNの第1の学習処理のフローチャート図である。FIG. 6 is a flowchart of DNN first learning processing in the present embodiment. 本実施の形態におけるDNNの第２の学習処理のフローチャート図である。It is a flowchart figure of the 2nd learning process of DNN in this Embodiment. 本実施の形態におけるDNNの第３の学習処理のフローチャート図である。It is a flowchart figure of the 3rd learning process of DNN in this Embodiment. 正解率及び誤差（損失関数、LOSS）と学習回数との関係を示す図である。It is a figure which shows the relationship between a correct answer rate, an error (loss function, LOSS), and the learning frequency.

本実施の形態の学習方法は、ニューラルネットワーク（NN）とディープニューラルネットワーク（DNN）のいずれにも適用可能である。但し、以下の説明では、DNNを例にして説明するが、NNにも適用可能である。 The learning method of the present embodiment can be applied to both a neural network (NN) and a deep neural network (DNN). However, in the following description, DNN will be described as an example, but the present invention can also be applied to NN.

図１は、本実施の形態におけるDNN学習装置であるDNN装置の構成を示す図である。DNN装置１は、コンピュータやサーバのような情報処理装置である。DNN装置１は、プロセッサ１０と、メインメモリ１２と、ネットワークインタフェース１４と、大容量の補助記憶装置１６とを有する。 FIG. 1 is a diagram illustrating a configuration of a DNN apparatus that is a DNN learning apparatus according to the present embodiment. The DNN device 1 is an information processing device such as a computer or a server. The DNN device 1 includes a processor 10, a main memory 12, a network interface 14, and a large-capacity auxiliary storage device 16.

補助記憶装置１６には、ＤＮＮプログラム２０と、ＤＮＮに設定されるＤＮＮのパラメータ２２と、ＤＮＮ学習プログラム２４と、ＤＮＮの学習に使用する教師データ及び評価データ２６とが記憶される。教師データと評価データは、ＤＮＮに入力する入力と、その時のＤＮＮの正しい出力である正解値とを有する。つまり、教師データと評価データは同じであり、但し、学習工程では教師データとして、評価工程では評価データとして使用される。そして、ＤＮＮプログラム２０と、ＤＮＮに設定されるＤＮＮのパラメータ２２と、ＤＮＮ学習プログラム２４と、ＤＮＮの学習に使用する教師データ及び評価データ２６とが、メインメモリ１２内に展開され、プロセッサが各プログラムを実行する。 The auxiliary storage device 16 stores a DNN program 20, a DNN parameter 22 set in the DNN, a DNN learning program 24, and teacher data and evaluation data 26 used for DNN learning. The teacher data and the evaluation data have an input to be input to the DNN and a correct value that is a correct output of the DNN at that time. That is, the teacher data and the evaluation data are the same, but are used as teacher data in the learning process and as evaluation data in the evaluation process. Then, the DNN program 20, the DNN parameters 22 set in the DNN, the DNN learning program 24, and teacher data and evaluation data 26 used for DNN learning are expanded in the main memory 12, and each processor Run the program.

ネットワークインタフェース１４がネットワークＮＷに接続され、ＤＮＮ装置１は、外部の端末装置３０，３２とネットワークＮＷを介して通信可能に接続される。 The network interface 14 is connected to the network NW, and the DNN device 1 is communicably connected to the external terminal devices 30 and 32 via the network NW.

ＤＮＮ装置１は、ディープラーニングのモデルとして、ディープニューラルネットワーク（ＤＮＮ）を採用する。ＤＮＮ装置１は、ＤＮＮへの入力データとその正解データとを有する教師データ及び評価データ２６を提供される。そして、プロセッサ１０は、ＤＮＮ学習プログラム２４を実行し、教師データを使用してＤＮＮの学習を実行し、ＤＮＮの最適なパラメータ（例えば重み）を決定する。また、プロセッサは、ＤＮＮ学習プログラムを実行し、評価データを使用してＤＮＮの出力の精度を評価する。さらに、プロセッサは、学習処理で抽出した最適のパラメータをＤＮＮプログラム２０に設定し、ＤＮＮプログラム２０を実行して、処理対象の画像等からのＤＮＮモデルの所期の推定処理を行う。 The DNN device 1 employs a deep neural network (DNN) as a deep learning model. The DNN device 1 is provided with teacher data and evaluation data 26 having input data to the DNN and correct answer data thereof. Then, the processor 10 executes the DNN learning program 24, performs DNN learning using the teacher data, and determines an optimal parameter (for example, weight) of the DNN. Further, the processor executes a DNN learning program and evaluates the accuracy of the output of the DNN using the evaluation data. Further, the processor sets the optimum parameters extracted in the learning process in the DNN program 20, executes the DNN program 20, and performs an expected estimation process of the DNN model from the processing target image or the like.

ＤＮＮプログラム２０は、モデルのＤＮＮの各種演算処理を実行するプログラムである。ＤＮＮ学習プログラム２４は、モデルのＤＮＮの学習や評価に伴うＤＮＮの各種演算処理と、最適なパラメータを抽出する処理とを実行するプログラムである。ＤＮＮ学習プログラムは、ＤＮＮの演算処理をＤＮＮプログラム２０をコールすることで実行する。ＤＮＮは教師データを使用して学習することでパラメータを最適化するので、ＤＮＮプログラム２０には、ＤＮＮ学習プログラム２４が必ず添付または内蔵される。 The DNN program 20 is a program for executing various calculation processes of the DNN of the model. The DNN learning program 24 is a program that executes various types of DNN computation processing associated with learning and evaluation of the DNN of the model, and processing for extracting optimal parameters. The DNN learning program executes DNN calculation processing by calling the DNN program 20. Since DNN optimizes parameters by learning using teacher data, DNN learning program 24 is always attached to or built in DNN program 20.

図２は、DNNの構成例を示す図である。DNNは、例えば、入力層INPUT_Lと、複数のDNNユニットDNN_U1〜DNN_Unと、全結合層FULCON_Lと、出力層OUTPUT_Lとを有する。各DNNユニットDNN_U1〜DNN_Unは、入力層の画像データなどをフィルタを構成する重みWで畳込み演算する畳込み層CONV_Lと、畳込み層の演算結果を活性化関数（例えばシグモイド関数）で判定する活性化関数層ACTF_Lと、例えば局所的な演算結果の最大値を抽出するプーリング層POOL_Lとを有する。DNNユニットの数は適切にチューニングされる。 FIG. 2 is a diagram illustrating a configuration example of the DNN. The DNN has, for example, an input layer INPUT_L, a plurality of DNN units DNN_U1 to DNN_Un, a full coupling layer FULCON_L, and an output layer OUTPUT_L. Each DNN unit DNN_U1 to DNN_Un determines a convolution layer CONV_L that convolves image data of the input layer with a weight W that constitutes a filter, and an operation function (for example, a sigmoid function). For example, the activation function layer ACTF_L and a pooling layer POOL_L that extracts the maximum value of local calculation results are included. The number of DNN units is tuned appropriately.

図３は、DNN内に含まれる３層構造のネットワークの例を示す図である。図３の例は、入力X₁〜X_nが入力される入力層INPUT_Lと、中間層（または隠れ層）IM_Lと、出力Z₁〜Z_nが出力される複数の出力ノードを有する出力層OUTPUT_Lとを有する。このネットワークでは、入力層の入力X₁〜X_nにそれぞれの重みw11〜w16が乗算され累積した値が中間層IM_Lに伝播する。この重みはネットワークのパラメータである。中間層IM_Lには前述の活性化関数ｆ１が配置され、中間層の各ノードの値Y₁〜Y_nは、以下の通りになる。
Ｙ_ｋ＝ｆ１（Σ（ｗ＊Ｘ_ｋ）−θ１）
ここで、θ１はシグモイド関数ｆ１の閾値、ｋは入力層のノード番号及び出力層のノード番号であり、ｋ＝１〜ｎである。 FIG. 3 is a diagram illustrating an example of a three-layer network included in the DNN. The example of FIG. 3 illustrates an output layer OUTPUT_L having an input layer INPUT_L to which inputs X _{1 to} X _n are input, an intermediate layer (or hidden layer) IM_L, and a plurality of output nodes to which outputs Z _{1 to} Z _n are output. And have. In this network, the inputs X _{1 to} X _n of the input layer are multiplied by the respective weights w 11 to w 16 and the accumulated values are propagated to the intermediate layer IM_L. This weight is a network parameter. The activation function f1 described above is arranged in the intermediate layer IM_L, and the values Y _{1 to} Y _n of the nodes of the intermediate layer are as follows.
Y _k = f1 (Σ (w * X _k ) −θ1)
Here, θ1 is a threshold value of the sigmoid function f1, k is a node number of the input layer and a node number of the output layer, and k = 1 to n.

さらに、中間層の値Y₁〜Y_nも同様に、それぞれの重みw21〜w26が乗算され累積した値が出力層OUTPUT_Lに伝播する。そして、出力層には別の活性化関数ｆ２が配置され、出力層の値Z₁〜Z_nは、以下の通りになる。
Ｚ_ｋ＝ｆ２（Σ（ｗ＊Ｙ_ｋ）−θ２）
よって、出力層の出力Ｚ_ｋは、２つの関数ｆ１、ｆ２の合成関数であり、パラメータである複数の重みを変数とする多変数関数である。 Further, the values Y _{1 to} Y _n of the intermediate layer are similarly multiplied by the respective weights w 21 to w 26 and the accumulated values are propagated to the output layer OUTPUT_L. Then, another activation function f2 is arranged in the output layer, and the values Z _{1 to} Z _n of the output layer are as follows.
Z _k = f2 (Σ (w * Y _k ) −θ2)
Accordingly, the output Z _k of the output layer is a composite function of two functions f1, f2, a multivariable function whose variable is the plurality of weights is a parameter.

図４は、ＤＮＮの学習方法の概略を示すフローチャート図である。図４の学習方法は、勾配降下法の１つであるミニバッチ法と呼ばれる方法である。この学習法では、プロセッサは、DNNのパラメータをランダムに選択した初期値に設定する（S40）。そして、多数の教師データから少数（例えば１０個）の教師データをランダムに選択し（S41）、選択した少数の教師データの入力を、パラメータの初期値を設定したDNNに入力し、DNNの演算を実行して出力を得る（S42）。そして、プロセッサは、選択した少数の教師データ全てについて、DNNの出力と正解値との差分の二乗和の総和Ｅを算出する（S43）。ここで、差分の二乗和は一つの教師データに対する各出力ノードの出力とその正解値との差分の二乗を累積したものであり、その総和は、１０個の教師データそれぞれの差分の二乗和を累積したものである。 FIG. 4 is a flowchart showing an outline of a DNN learning method. The learning method of FIG. 4 is a method called a mini batch method, which is one of the gradient descent methods. In this learning method, the processor sets DNN parameters to randomly selected initial values (S40). Then, a small number (for example, 10) of teacher data is randomly selected from a large number of teacher data (S41), and the input of the selected small number of teacher data is input to the DNN in which the initial parameter values are set, and the DNN calculation is performed. To obtain an output (S42). Then, the processor calculates the sum E of the sum of squares of the differences between the DNN output and the correct answer value for all the selected small number of teacher data (S43). Here, the sum of squares of differences is the sum of the squares of the differences between the output of each output node for one teacher data and its correct value, and the sum is the sum of squares of the differences of each of the ten teacher data. Accumulated.

プロセッサは、この二乗和の総和が基準値未満に収束したか否か判定し（S44）、基準値未満でなければ（S44のNO）、二乗和の総和の勾配ΔＥに基づいて、DNNの新たなパラメータ（重み）を求め、DNNに設定する（S45）。プロセッサは、DNN内の複数の重みを更新するために、誤差逆伝播法により出力層の出力値と正解値との差分である誤差を、DNNの入力層側に伝播させて、各層間の重みを勾配に基づいて更新する。 The processor determines whether or not the sum of the square sums has converged below the reference value (S44). If not less than the reference value (NO in S44), the processor calculates a new DNN based on the slope ΔE of the sum of square sums. A new parameter (weight) is obtained and set to DNN (S45). In order to update multiple weights in the DNN, the processor propagates an error, which is the difference between the output value of the output layer and the correct value, to the input layer side of the DNN using the error back propagation method. Is updated based on the slope.

そして、プロセッサは、工程S44の判定がYESになるまで、工程S41からS44の処理を、それぞれ別の少数の教師データを使用して繰り返す。工程S44の判定がYESになると、その時のパラメータをDNNの最適化されたパラメータとして出力する。 Then, the processor repeats the processes of steps S41 to S44 using different small numbers of teacher data until the determination of step S44 becomes YES. If the determination in step S44 is YES, the parameter at that time is output as a DNN optimized parameter.

ミニバッチ法は、少数の教師データについての誤差の二乗和の総和に基づいて、勾配降下法でパラメータを更新する。したがって、教師データに通常の入力と正解値から遠くかけ離れたアブノーマルなものが含まれていても、そのアブノーマルな教師データによる悪影響を抑制できるという利点があるといわれている。 In the mini-batch method, the parameters are updated by the gradient descent method based on the sum of squares of errors for a small number of teacher data. Therefore, it is said that even if the teacher data includes an abnormal data far from the normal input and the correct answer value, an adverse effect due to the abnormal teacher data can be suppressed.

ミニバッチ法ではない逐次更新学習法では、１つの教師データについて出力層の複数のノードの誤差の二乗和に基づいて、勾配降下法でパラメータを更新する。逐次更新学習法を採用した場合も、本実施の形態を適用することができる。 In the sequential update learning method that is not the mini-batch method, the parameters are updated by the gradient descent method based on the sum of squares of errors of a plurality of nodes in the output layer for one teacher data. The present embodiment can also be applied when the sequential update learning method is employed.

図５は、勾配降下法を説明する図である。横軸は、DNNのパラメータである重みｗを、縦軸は、誤差の二乗和の総和である誤差関数Eを示す。図５では、説明を簡単にするために、単一の重みｗの軸しか示していない。但し、前述のとおりDNNの重みｗは複数であり、したがって、誤差関数Eは多変数関数である。 FIG. 5 is a diagram for explaining the gradient descent method. The horizontal axis represents the weight w, which is a DNN parameter, and the vertical axis represents the error function E, which is the sum of squared errors. In FIG. 5, only the axis of a single weight w is shown for ease of explanation. However, as described above, the DNN has a plurality of weights w, and therefore the error function E is a multivariable function.

誤差関数Eは、図３のネットワークの例では、例えば以下のとおりである。
Ｅ＝ 1/2＊Σ_k（Z_k−t_k）²
ここで、ｋ＝１〜ｎ、Z_kは、図３に示したとおり、出力層の複数のノードそれぞれの出力値であり、t_kは教師データの正解値である。出力値Ｚ_ｋは、複数の重みを変数とする多変数関数であるので、誤差関数Ｅも同様に複数の重みを変数とする多変数関数である。 The error function E is, for example, as follows in the example of the network of FIG.
E = 1/2 * Σ _k (Z _k −t _k ) ²
Here, as shown in FIG. 3, k = 1 to n and Z _k are output values of a plurality of nodes in the output layer, and t _k is a correct value of the teacher data. Since the output value Z _k is a multivariable function having a plurality of weights as variables, the error function E is also a multivariable function having a plurality of weights as variables.

ミニバッチ法の場合は、少数の複数の教師データに対する誤差の総和が誤差Ｅとなるので、次のとおりとなる。
Ｅ＝ 1/2＊Σ_l{Σ_k（Z_k−t_k）²}
ここで、ｌ＝１〜Ｌ、Ｌは複数の教師データの数である。 In the case of the mini-batch method, the sum of errors with respect to a small number of a plurality of teacher data becomes the error E.
E = 1/2 * Σ _l {Σ _k (Z _k −t _k ) ² }
Here, l = 1 to L and L are the numbers of a plurality of teacher data.

図５を参照して勾配降下法を説明すると、プロセッサは、パラメータである重みｗを初期値ｗ_１に設定したDNNに教師データの入力を入力して得られた出力Ｚと教師データの正解値ｔとの差分である誤差の二乗和の総和Ｅを求める。これは図４の工程Ｓ４１−４３に対応する。そして、工程Ｓ４５のとおり、プロセッサは、誤差の二乗和の総和Ｅの勾配ΔＥと学習率ηに基づいて、次の式により重みｗを更新する。
w^new = w^old - η*(ΔＥ)
ここで、w^oldは更新前の重み、w^newは更新後の重みである。ΔＥは、誤差関数Ｅを各変数（重み）で偏微分した値であり、ΔＥ＝∂E/∂wである。学習率ηは、多くの場合０≦η≦１であり、例えば０．０００１から０．１など小さな値を取ることが多い。 The gradient descent method will be described with reference to FIG. 5. The processor outputs the output Z obtained by inputting the input of the teacher data to the DNN in which the weight w as a parameter is set to the initial value w ₁ and the correct value of the teacher data. A total sum E of squares of errors which are differences from t is obtained. This corresponds to steps S41-43 in FIG. Then, as in step S45, the processor updates the weight w according to the following expression based on the gradient ΔE of the sum E of squares of errors and the learning rate η.
w ^new = w ^old -η * (ΔE)
Here, w ^old is a weight before update, and w ^new is a weight after update. ΔE is a value obtained by partial differentiation of the error function E with each variable (weight), and ΔE = ∂E / ∂w. In many cases, the learning rate η is 0 ≦ η ≦ 1, and often takes a small value such as 0.0001 to 0.1.

勾配が負であれば更新後の変数は図５の右方向に移動し、勾配が正であれば更新後の変数は左方向に移動する。図５の例では、変数がｗ_１の場合の勾配が負であり、更新後の変数ｗ２は右方向に移動している。DNNの複数の層の間にそれぞれ変数が設定される。そのため、DNNの出力層の出力と正解値との差分である誤差を、誤差逆伝播法によりDNNの入力層側に伝播し、各層の変数を上記の演算式により更新する。 If the gradient is negative, the updated variable moves to the right in FIG. 5, and if the gradient is positive, the updated variable moves to the left. In the example of FIG. 5, the variable is negative gradient in the case of w _1, the variable w2 updated is moved to the right direction. Each variable is set between multiple layers of DNN. Therefore, an error, which is the difference between the output of the DNN output layer and the correct value, is propagated to the input layer side of the DNN by the error back propagation method, and the variables of each layer are updated by the above arithmetic expressions.

図５の例では、プロセッサは、変数ｗ_２が設定されたDNNに別の教師データの入力を入力して出力Ｚを求め、正解値tとの誤差の二乗和の総和Ｚを求める。そして、プロセッサは、誤差の二乗和の総和の勾配ΔEと学習率ηに基づいて、前述の式により新たな重みを計算する。図５の例では、変数ｗ_２での勾配ΔＥも負である。 In the example of FIG. 5, the processor inputs another teacher data input to the DNN in which the variable w ₂ is set to obtain an output Z, and obtains a sum Z of squares of errors from the correct value t. Then, the processor calculates a new weight based on the gradient ΔE and the learning rate η of the sum of squared errors and the learning rate η. In the example of FIG. 5, the gradient ΔE at the variable w ₂ is also negative.

以下同様にして、プロセッサは、新たな重みで更新されたDNNについて教師データを使用して誤差の二乗和の総和Ｚを求め、その勾配と学習率に基づいて上記の式により新たな重みを計算することを繰り返す。図５の例では、重みｗ_３，ｗ_４，ｗ_５で続けて勾配が負となるが、次の重みｗ_６では勾配が正となり、プロセッサは、その後学習率を小さくすることで、誤差の二乗和Ｚが最小値となる重みｗ_ｍｉｎを検出する。 In the same manner, the processor uses the teacher data for the DNN updated with the new weight to determine the sum of squares of the error Z, and calculates the new weight based on the gradient and the learning rate using the above formula. Repeat to do. In the example of FIG. 5, the gradient continues to be negative with the weights w ₃ , w ₄ , and w ₅ , but the gradient becomes positive with the next weight w ₆ , and the processor then reduces the error rate by reducing the learning rate. A weight w _{min at} which the sum of squares Z becomes the minimum value is detected.

[本実施の形態]
[勾配降下法の問題点]
勾配降下法の問題点の一つは、学習率の選択の困難性である。学習率を低く選択すると、DNNの精度（正解率や誤差）がなかなか改善されず、学習工程が長時間になる。一方で、学習率を高く選択すると、初期の学習の進捗は早くなりある程度の精度になるまでの時間は短くできるが、途中で学習が破綻し、精度が逆に大きく低下（悪化）したまま改善されなくなることもある。 [This embodiment]
[Problems of the gradient descent method]
One problem with the gradient descent method is the difficulty in selecting the learning rate. If the learning rate is selected to be low, the DNN accuracy (accuracy rate and error) is not easily improved, and the learning process takes a long time. On the other hand, if a high learning rate is selected, the initial learning progresses faster and the time required to reach a certain level of accuracy can be shortened. However, the learning fails during the process, and the accuracy is greatly reduced (deteriorated). It may not be done.

NN、とりわけDNNの学習は非常に時間がかかる。画像認識コンテストで使われるDNNと教師データでは、GPU（Grafic Processor Unit）のようなハードウエアセラレータを用いても、学習工程が１週間以上かかるものがある。そのため、学習率を高くして学習工程を短縮化しようとすると、学習が破綻して最初からやり直しが必要となり、逆に学習工程が長期化することがある。 Learning NN, especially DNN, is very time consuming. Some DNNs and teacher data used in image recognition contests require more than a week of learning, even if a hardware selerator such as a GPU (Grafic Processor Unit) is used. Therefore, if an attempt is made to shorten the learning process by increasing the learning rate, learning fails, and it is necessary to start over from the beginning, and conversely, the learning process may be prolonged.

また、学習中にDNNの精度が当初は大きく改善した後、徐々に悪化することもある。このような場合も学習率を選びなおすことで徐々に悪化することを回避できる場合がある。 In addition, during learning, the accuracy of DNN may initially improve and then gradually deteriorate. Even in such a case, it may be possible to avoid a gradual deterioration by reselecting the learning rate.

図６は、勾配降下法の問題点について説明する図である。図６の誤差Ｅの曲線は図５と同じであるが、図６の例では、学習率ηが図５よりも高く設定され且つ一定とする。図中、ｔは学習サイクルの時を示し、各ｔでのＷは時間ｔにおける重みを示す。図６の誤差Ｅの曲線は、誤差Ｅが最小になる点（重みＷ_min）と、最小ではないが極小点（重みＷ_local）とを有する。目標は誤差Ｅを最小化する重みＷ_minであり、重みＷ_localは誤差Ｅを局所解に落とし込む重みである。誤差Ｅが最小化することはDNNの出力の精度が最良になることである。 FIG. 6 is a diagram for explaining a problem of the gradient descent method. The error E curve of FIG. 6 is the same as that of FIG. 5, but in the example of FIG. 6, the learning rate η is set higher than that of FIG. In the figure, t indicates the time of the learning cycle, and W at each t indicates the weight at time t. The error E curve of FIG. 6 has a point where the error E is minimum (weight W _min ) and a minimum point (weight W _local ) that is not minimum. The target is a weight W _min that minimizes the error E, and the weight W _local is a weight that drops the error E into the local solution. Minimizing the error E means that the DNN output has the best accuracy.

時間ｔ＝１の、初期値の重みＷ₁が設定されたDNNでは、勾配∂E/∂wは負で絶対値が大きいため、更新後の重みＷ₂は正の方向（右方向）に大きく移動し、また、誤差関数Ｅ（Ｗ₂）も大きく減少している。 In DNN with initial value weight W ₁ set at time t = 1, gradient ∂E / ∂w is negative and the absolute value is large, so updated weight W ₂ is large in the positive direction (right direction). In addition, the error function E (W ₂ ) is greatly reduced.

時間ｔ＝２の、重みＷ₂が設定されたDNNでは、勾配∂E/∂wは負で絶対値が中で、更新後の重みＷ₃は正の方向（右方向）に時間ｔ＝１よりは小さいが比較的大きく移動している。但し、誤差関数Ｅ（Ｗ₃）はむしろ増加している。 In DNN with weight W ₂ set at time t = 2, gradient ∂E / ∂w is negative and has an absolute value, and updated weight W ₃ is time t = 1 in the positive direction (right direction). Smaller but relatively large. However, the error function E (W ₃ ) rather increases.

次に、時間ｔ＝３の、重みＷ₃が設定されたDNNでは、勾配∂E/∂wは負で絶対値が小であるため、更新後の重みＷ₄はさらに正の方向（右方向）に少し移動している。勾配の絶対値が小さいため、更新後の誤差関数Ｅ（Ｗ₄）はＥ（Ｗ₃）とほとんど同じで増減しない。 Next, in the DNN with the weight W ₃ set at the time t = 3, the gradient ∂E / ∂w is negative and the absolute value is small, so the updated weight W ₄ is further in the positive direction (rightward direction). ) Has moved a bit. Since the absolute value of the gradient is small, the updated error function E (W ₄ ) is almost the same as E (W ₃ ) and does not increase or decrease.

次に、時間ｔ＝４の、重みＷ₄が設定されたDNNでは、勾配∂E/∂wは正で絶対値が小であるため、更新後の重みＷ₅（＝Ｗ₃）は逆に負の方向（左方向）に少し移動し、更新後の誤差関数Ｅ（Ｗ₅）はＥ（Ｗ₄）とほとんど同じで増減しない。 Next, in the DNN with the weight W ₄ set at the time t = 4, the gradient ∂E / ∂w is positive and the absolute value is small, so the updated weight W ₅ (= W ₃ ) is reversed. It moves slightly in the negative direction (left direction), and the updated error function E (W ₅ ) is almost the same as E (W ₄ ) and does not increase or decrease.

その後、時間ｔが奇数ではｔ＝３での誤差Ｅの付近を、偶数ではｔ＝４での誤差Ｅの付近を超えないように左右に振動し、最終的に局所解Ｗ_local付近に留まる。一般には、プロセッサは、学習が停滞したとみなし、学習率ηを下げていくため、左右の振れ幅は徐々に小さくなり、最終的に局所解Ｗ_localに収束する。 After that, when the time t is an odd number, it vibrates left and right so that it does not exceed the vicinity of the error E at t = 3 when it is an odd number, and finally stays near the local solution W _local . In general, the processor considers that learning is stagnant and lowers the learning rate η, so that the left and right swing width gradually decreases and finally converges to the local solution W _local .

上記において、時間ｔ＝２と時間ｔ＝３での誤差関数Ｅの値の差分ｄＥは非常に大きく、学習が破綻したともいえる。しかし、上記のとおり、図６の例では最終的に局所解に収束している。 In the above, the difference dE between the values of the error function E between time t = 2 and time t = 3 is very large, and it can be said that learning has failed. However, as described above, the example in FIG. 6 finally converges to a local solution.

[DNNの学習工程]
図７は、本実施の形態におけるDNNの第1の学習処理のフローチャート図である。まず、プロセッサは、DNNのパラメータ（重み、シグモイド関数の閾値等）、学習処理のパラメータ（学習率η、正解率の瞬間値Anと最大値Amax等）を初期化する（S1）。 [DNN learning process]
FIG. 7 is a flowchart of DNN first learning processing according to the present embodiment. First, the processor initializes DNN parameters (weight, sigmoid function threshold, etc.) and learning processing parameters (learning rate η, correct value instantaneous value An, maximum value Amax, etc.) (S1).

次に、プロセッサは、学習工程S100を実行する。学習工程S100では、プロセッサが、複数の教師データを使用する学習を所定回数（Ｍ回）実行する(S11)。ここでの複数の教師データを使用する学習とは、例えば、図４の工程S41〜S45（但しS44は除く）である。工程S11についての具体例は後述する。プロセッサは、複数の教師データを使用する学習工程それぞれで、図４の勾配降下法によりDNNのパラメータ（重み等）を更新する。 Next, the processor executes a learning step S100. In the learning step S100, the processor executes learning using a plurality of teacher data a predetermined number of times (M times) (S11). Here, learning using a plurality of teacher data is, for example, steps S41 to S45 (excluding S44) in FIG. A specific example of step S11 will be described later. The processor updates the DNN parameters (weights, etc.) by the gradient descent method shown in FIG. 4 in each learning step using a plurality of teacher data.

そして、学習工程S100では、プロセッサが、工程S11で更新されたパラメータを設定したDNNで、認識テストを実行し、DNNの正解率Anを取得する。認識テストとは、工程S11で使用したのとは別の教師データまたは評価データの入力でDNNの演算（推定）を実行し、DNNの出力が教師データまたは評価データの正解値と一致（正解）するか、不一致（非正解）かを判定する。したがって、認識テストは、DNNの出力と正解値との誤差に基づいてDNNのパラメータを更新する工程を行わないことを除くと、工程S11での学習と同等である。つまり、工程S11で最後の教師データの入力についてDNNの出力を演算し、そのDNNの出力が正解値と一致するか否かを判定することと同等である。 In the learning step S100, the processor executes a recognition test with the DNN in which the parameter updated in step S11 is set, and acquires the correct answer rate An of DNN. In the recognition test, DNN calculation (estimation) is executed with input of teacher data or evaluation data different from that used in step S11, and the output of DNN matches the correct value of teacher data or evaluation data (correct answer) Whether or not to match (incorrect answer). Therefore, the recognition test is equivalent to the learning in step S11 except that the step of updating the DNN parameters based on the error between the DNN output and the correct answer value is not performed. That is, this is equivalent to calculating the DNN output for the last teacher data input in step S11 and determining whether the DNN output matches the correct answer value.

上記の正解率Anは、DNNの精度の1つである。DNNの精度の別の例は、教師データまたは評価データの入力でDNNの演算を実行して得た出力と、教師データまたは評価データの正解値との誤差（損失関数、LOSS）でもよい。この誤差は、学習工程と同様に出力と正解値との差分の二乗和でよい。DNNの出力層が複数の出力ノードを有する場合、その複数の出力ノードそれぞれの出力と、複数の出力ノードそれぞれの正解値とのそれぞれの差分の二乗和である。 The accuracy rate An described above is one of DNN accuracy. Another example of the accuracy of the DNN may be an error (loss function, LOSS) between an output obtained by executing the DNN calculation with the input of the teacher data or the evaluation data and the correct value of the teacher data or the evaluation data. This error may be the sum of squares of the difference between the output and the correct value as in the learning step. When the output layer of the DNN has a plurality of output nodes, it is the sum of squares of the difference between the output of each of the plurality of output nodes and the correct value of each of the plurality of output nodes.

正解率は、例えば、複数回認識テストを実行し、出力が正解値と一致した正解の回数を認識テストの回数で除した比率である。逆に誤差（損失関数、LOSS）は、例えば、複数回認識テストを実行し、それぞれの誤差（E=(1/2)*Σ_k(y_k-t_k)²）を合計した値または平均した値である。 The correct answer rate is, for example, a ratio obtained by dividing the number of correct answers when the recognition test is executed a plurality of times and the output matches the correct value by the number of recognition tests. Conversely, for the error (loss function, LOSS), for example, a recognition test is performed multiple times, and each error (E = (1/2) * Σ _k (y _k -t _k ) ² ) is the sum or average It is the value.

上記の学習工程S100を1回の学習サイクルと称する。そして、プロセッサは、学習工程S100の回数を学習サイクル数ｎとしてカウントする（S13）。 The learning step S100 is referred to as a single learning cycle. Then, the processor counts the number of learning steps S100 as the number of learning cycles n (S13).

次に、プロセッサは、学習工程の状態の記憶工程と、学習工程の破綻判定を含む破綻判定工程S200を実行する。破綻判定工程S200では、プロセッサは、その時の正解率Anがそれまでの正解率の最良値（最大値）Amaxより良いか否か（大きいか否か）の判定（S21）と、良い場合（大きい場合）にその時の正解率Anを正解率の最良値（最大値）Amaxに、DNNのパラメータと学習サイクルを状態Smaxに記憶する工程（S22）とを実行する。これらの工程S21,S22が、学習工程の状態の記憶工程に該当する。 Next, the processor executes a failure determination step S200 including a learning step state storage step and a learning step failure determination. In the failure determination step S200, the processor determines whether the accuracy rate An at that time is better (larger than) Amax of the accuracy rate so far (maximum value) Amax (S21) and if it is good (large) In this case, the step (S22) of storing the correct answer rate An at that time in the best value (maximum value) Amax of the correct answer rate and the DNN parameter and the learning cycle in the state Smax are executed. These steps S21 and S22 correspond to the storing step of the state of the learning step.

さらに、プロセッサは、その時の正解率Anがそれまでの正解率の最大値Amaxと比較して破綻しているか否かの判定工程（S23）と、破綻と判定した場合（S23のYES）にDNNのパラメータと学習サイクルを過去の状態Smaxに戻し、学習率ηを減少させる。この工程S23とS24が学習工程の破綻判定に該当する。判定工程S23では、例えば、その時の正解率Anが最大値Amaxと比較して大きく悪化した場合や、大きく悪化した状態が何回も継続して起こった場合に、学習が破綻していると判定する。 Further, the processor determines whether or not the correct answer rate An at that time has failed compared to the maximum value Amax of the correct answer rate so far (S23), and if it is determined to be failed (YES in S23), DNN These parameters and the learning cycle are returned to the past state Smax, and the learning rate η is decreased. Steps S23 and S24 correspond to the failure determination of the learning step. In the determination step S23, for example, it is determined that learning has failed when the accuracy rate An at that time has greatly deteriorated compared to the maximum value Amax, or when the state of greatly deterioration has continued many times. To do.

この破綻判定工程S200では、プロセッサは、DNNの学習を繰り返す工程中に、更新されたパラメータのDNNの精度（正解率や誤差）が最良値から長期にわたり改善されない場合や、大きくかけ離れた状態に陥った場合、学習が破綻したと判定する。そして、破綻したと判定した場合、DNNを過去に正解率が最大値Amaxとなった時の状態Smaxに戻して、学習率ηを低下させて、学習を再開する。プロセッサは、過去に正解率が最大値Amaxとなった時の状態Smaxよりも所定の学習サイクル数だけ過去の状態にDNNを戻すようにしても良い。 In the failure determination step S200, during the process of repeating DNN learning, the processor is in a state in which the accuracy (accuracy rate and error) of the updated parameter is not improved from the best value over a long period of time, or is greatly different. If it is, it is determined that learning has failed. If it is determined that the bankruptcy has occurred, the DNN is returned to the state Smax when the correct answer rate has reached the maximum value Amax in the past, the learning rate η is reduced, and learning is resumed. The processor may return the DNN to the past state by a predetermined number of learning cycles from the state Smax when the correct answer rate has reached the maximum value Amax in the past.

そして、プロセッサは、学習工程S100と破綻判定工程S200とを所定のサイクル完了するまで（S31のYES）、またはDNNの正解率Amaxが期待値より高くなるまたは高い値に収束するまで（S31のYES）、学習工程S100と破綻判定工程S200とを繰り返す。工程S31でYESになると、プロセッサは、DNNのパラメータ（重みとシグモイド関数の閾値）を保存し（S32）、DNNの学習を終了する。 The processor then completes the learning step S100 and the failure determination step S200 for a predetermined cycle (YES in S31), or until the DNN accuracy rate Amax is higher than the expected value or converges to a higher value (YES in S31). ), The learning step S100 and the failure determination step S200 are repeated. If YES in step S31, the processor stores the DNN parameters (weight and sigmoid function threshold) (S32), and ends the DNN learning.

上記のように、プロセッサは、DNNのパラメータや学習率などを定期的に保存し、DNNの精度が長期にわたり改善されない場合や精度が発散した場合などに学習が破綻したと判断し、過去に保存していた数世代前の状態にDNNを戻し、学習率を下げ、DNNの学習を再開する。 As mentioned above, the processor periodically stores DNN parameters, learning rate, etc., and determines that learning has failed when DNN accuracy is not improved over time or when accuracy diverges and stores it in the past Return the DNN to the state several generations ago, lower the learning rate, and resume DNN learning.

より具体的には、プロセッサは、学習が破綻したことを検出すると、DNNを過去の状態Smaxまたは状態Smaxより所定の学習サイクル数過去に遡った状態に戻し、学習率ηを低下させて、学習を再開するので、その後の学習工程で学習が破綻する状態を回避することができる。また、学習を最初からやり直すよりも、全体の学習工程を短くできる。 More specifically, when the processor detects that the learning has failed, the DNN is returned to the state Smax in the past or a state that goes back a predetermined number of learning cycles from the state Smax, and the learning rate η is lowered to reduce the learning rate η. Therefore, it is possible to avoid a state in which learning fails in the subsequent learning process. Moreover, the entire learning process can be shortened rather than starting learning again from the beginning.

図８は、本実施の形態におけるDNNの第２の学習処理のフローチャート図である。図８の第２の学習処理は、破綻判定工程S200が図７と異なる。初期化工程S1と、学習工程S100と、工程S31,S32は、図７と同じである。但し、第２の学習工程では、学習の破綻を判定する正解率の大幅低下回数ｕをカウントする。そのため、初期化工程S1では、プロセッサは正解率の大幅低下回数ｕをｕ＝０と初期化する。 FIG. 8 is a flowchart of DNN second learning processing in the present embodiment. The second learning process in FIG. 8 is different from that in FIG. 7 in the failure determination step S200. The initialization step S1, the learning step S100, and the steps S31 and S32 are the same as those in FIG. However, in the second learning step, the number u of significant reductions in the correct answer rate for determining failure of learning is counted. Therefore, in the initialization step S1, the processor initializes the number u of significant reductions in the correct answer rate to u = 0.

図８の破綻判定工程S200では、プロセッサは、その時の正解率Anがそれまでの正解率の最大値Amaxより破綻閾値である基準値Kより大きく低下したか否か（An＜Amax-K?）を判定し（S231）、大きく低下した場合、大幅低下回数をｕ＝ｕ＋１とインクリメントする（S232）。そして、プロセッサは、正解率Anが大きく低下することが連続Ｕ回発生すると、つまり大幅低下回数ｕがＵ回に達すると（S233のYES）、学習が破綻したと判定する。 In the failure determination step S200 of FIG. 8, the processor determines whether or not the correct answer rate An at that time has decreased more than the reference value K that is the failure threshold value from the maximum value Amax of the correct answer rate so far (An <Amax-K?). Is determined (S231), and the number of significant decreases is incremented to u = u + 1 (S232). Then, the processor determines that the learning has failed when the correct answer rate An significantly decreases U times, that is, when the number of significant decreases u reaches U times (YES in S233).

学習の破綻を判定すると、プロセッサは、DNNを過去の状態Smaxに変更し、学習率ηを減少させ、大幅低下回数ｕを初期化（ｕ＝０）する（S24B）。図８の工程24Bは、図７の工程S24と異なり、大幅低下回数ｕの初期化を行う。また、プロセッサは、その時の正解率AnがAmax-K以上の場合（S21のYES、S231のNO）、大幅低下回数ｕを初期化（ｕ＝０）する（S234）。つまり、プロセッサは、その時の正解率AnがAmax-K未満になることが連続してＵ回に達すると学習の破綻を判定する（S233）。したがって、プロセッサは、その時の正解率AnがAmax-K以上になると（S231のNO）、大幅低下回数ｕをリセットする（S234）。 When determining the failure of learning, the processor changes the DNN to the past state Smax, decreases the learning rate η, and initializes the number of significant decrease u (u = 0) (S24B). In step 24B of FIG. 8, unlike the step S24 of FIG. Further, when the correct answer rate An at that time is equal to or greater than Amax-K (YES in S21, NO in S231), the processor initializes the number of significant decrease u (u = 0) (S234). That is, the processor determines the failure of learning when the accuracy rate An at that time reaches U times continuously for less than Amax-K (S233). Accordingly, when the correct answer rate An at that time becomes equal to or greater than Amax-K (NO in S231), the processor resets the large decrease number u (S234).

上記の破綻の判定方法によれば、DNNの出力の精度が長期にわたり改善されない場合や、精度が大きく悪化した場合に、学習が破綻したと判定することができる。 According to the above-described failure determination method, it is possible to determine that learning has failed when the accuracy of DNN output is not improved over a long period of time, or when the accuracy greatly deteriorates.

図８に示した破綻判定工程S200は、一例である。例えば、プロセッサは、その時の正解率Anが最大値Amaxの所定比率L（０＜L＜１．０）倍未満になることが連続してＵ回に達すると学習の破綻を判定するようにしてもよい。 The failure determination step S200 shown in FIG. 8 is an example. For example, the processor determines the failure of learning when the correct answer rate An at that time becomes less than a predetermined ratio L (0 <L <1.0) times the maximum value Amax continuously reaches U times. Also good.

または、別の例では、最大値Amaxが高くなるにしたがって、上記の破綻閾値である基準値Kを小さく、または比率Lを高くするようにしてもよい。通常、学習工程を繰り返すと正解率の最大値は上昇するので、学習の開始期間では、基準値Kを大きくまたは比率Lを低くして学習の破綻程度を大きくし、学習の最終期間では、基準値Kを小さくまたは比率Lを高くして、学習の破綻程度を小さくするようにする。 Alternatively, in another example, as the maximum value Amax increases, the reference value K, which is the failure threshold, may be decreased or the ratio L may be increased. Normally, the maximum value of the correct answer rate increases as the learning process is repeated.Therefore, in the learning start period, the reference value K is increased or the ratio L is decreased to increase the degree of failure of the learning. The value K is decreased or the ratio L is increased to reduce the degree of learning failure.

図９は、本実施の形態におけるDNNの第３の学習処理のフローチャート図である。図９には、図７または図８の学習工程S100の変形例が示される。第３の学習処理では、初期化工程S1と学習の破綻判定工程S200と、工程S31,S32は、図７または図８と同じである。 FIG. 9 is a flowchart of the third DNN learning process in the present embodiment. FIG. 9 shows a modification of the learning step S100 of FIG. 7 or FIG. In the third learning process, the initialization step S1, the learning failure determination step S200, and the steps S31 and S32 are the same as those in FIG.

図９に示した第３の学習処理での学習工程S100では、プロセッサは、複数（N個）の教師データでDNNの演算を実行し（S111）、各教師データで演算したDNNの出力y_kと教師データの正解値t_kとの差分の二乗和を、N個の教師データ分累積した、二乗和の総和である誤差関数Ｅの値を算出する(S112)。前述のとおり、DNNの出力層が複数の出力ノードを有する場合、教師データの入力に対して演算したDNNの出力値は複数生成されるので、各教師データでの出力y_kと正解値t_kとの差分の二乗和は、複数のノードの出力とその正解値との差分の二乗和である。そして、工程S112では、プロセッサは、教師データそれぞれに対する二乗和を、N個の教師データ分加算した総和Ｅを算出する。 In the learning step S100 in the third learning process shown in FIG. 9, the processor executes the DNN calculation with a plurality (N) of teacher data (S111), and the DNN output y _k calculated with each teacher data And the value of the error function E, which is the sum of the square sums, obtained by accumulating the square sum of the difference between the correct answer value t _k of the teacher data and the N teacher data (S112). As described above, when the DNN output layer has a plurality of output nodes, a plurality of DNN output values calculated with respect to the input of the teacher data are generated. Therefore, the output y _k and the correct value t _{k for} each teacher data are generated. Is the sum of squares of differences between the outputs of a plurality of nodes and their correct values. In step S112, the processor calculates a sum E obtained by adding the sum of squares for each teacher data to N teacher data.

そして、プロセッサは、誤差関数Ｅの傾きΔＥを求め、勾配降下法によりDNNのパラメータ（複数の重みW）を更新する（S113）。上記の工程S111〜S113はミニバッチ法と呼ばれる学習である。そして、プロセッサは、上記の工程S111〜S113を、所定回数（M回）繰り返す（S114）。 Then, the processor obtains the gradient ΔE of the error function E, and updates the DNN parameters (a plurality of weights W) by the gradient descent method (S113). The above steps S111 to S113 are learning called a mini-batch method. Then, the processor repeats the above steps S111 to S113 a predetermined number of times (M times) (S114).

上記の工程S111〜S114は、図７、図８の工程S11に対応する。このように、プロセッサは、工程S111〜S113の学習を所定回数（M回）繰り返す。 The above steps S111 to S114 correspond to step S11 in FIGS. In this way, the processor repeats learning in steps S111 to S113 a predetermined number of times (M times).

さらに、学習工程S100では、プロセッサは、所定回数（M回）繰り返したDNNのパラメータ（重みW）を設定したDNNで、認識テストを複数回実行する（S121）。そして、プロセッサは、複数回の認識テストで得られた正解率An（DNNの出力が正解値と一致する回数を認識テストの回数で除した比率）を算出する（S122）。上記の工程S121とS122が、図８、図９の工程S12に該当する。 Further, in the learning step S100, the processor executes the recognition test a plurality of times with the DNN set with the DNN parameter (weight W) repeated a predetermined number of times (M times) (S121). Then, the processor calculates the correct answer rate An (the ratio obtained by dividing the number of times the DNN output matches the correct answer value by the number of recognition tests) obtained in a plurality of recognition tests (S122). The above steps S121 and S122 correspond to step S12 in FIGS.

第３の学習処理での学習工程S100では、上記の工程S111〜S114とS121〜S122を、統計回数繰り返す（S123のNO）。この統計回数とは、次のような意味である。すなわち、認識テストで使用した教師データのばらつきに対して出力の精度（正解率や誤差）がばらつくことが経験上知られている。そこで、プロセッサは、学習工程S100を出力の精度（正解率や誤差）のばらつきを適切に抑制できる程度の統計回数だけ繰り返す。そして、プロセッサは、統計回数の学習工程S100が完了すると（S123のYES）、学習サイクル数ｎをｎ＝ｎ＋１とインクリメントすると共に、Ｍ回の学習工程で取得したＭ個の正解率の中央値を学習サイクルｎでの正解率Anとして生成する（S131）。 In the learning step S100 in the third learning process, the above steps S111 to S114 and S121 to S122 are repeated for the number of statistics (NO in S123). The number of statistics means as follows. That is, it is known from experience that output accuracy (accuracy rate and error) varies with respect to variations in teacher data used in the recognition test. Therefore, the processor repeats the learning step S100 as many times as the number of statistics that can appropriately suppress variations in output accuracy (accuracy rate and error). Then, when the statistical number of learning steps S100 is completed (YES in S123), the processor increments the number of learning cycles n to n = n + 1 and sets the median of the M correct answer rates acquired in the M learning steps. The correct answer rate An in the learning cycle n is generated (S131).

工程S131は、図８、図９の工程S13に対応し、工程S13と異なり、正解率AnはＭ個の正解率の中央値である。正解率Anは、例えば、Ｍ個の正解率の最小二乗法により求めた値でもよい。 Step S131 corresponds to step S13 in FIGS. 8 and 9, and unlike step S13, the accuracy rate An is the median value of the M accuracy rates. The accuracy rate An may be, for example, a value obtained by the least square method of M accuracy rates.

図１０は、正解率及び誤差（損失関数、LOSS）と学習回数との関係を示す図である。左側のグラフでは、縦軸が正解率、横軸が学習回数であり、右側のグラフでは、縦軸が誤差（LOSS）、横軸が学習回数である。図１０の（１）は、学習毎（つまり学習に含まれる認識テスト毎）にばらつく正解率と誤差（LOSS）の一例を示す。それに対して、図１０の（２）は、所定回数の正解率と誤差（LOSS）の中央値（黒点）を示す。 FIG. 10 is a diagram illustrating the relationship between the accuracy rate and error (loss function, LOSS) and the number of learnings. In the graph on the left, the vertical axis represents the accuracy rate and the horizontal axis represents the number of learnings. In the graph on the right, the vertical axis represents the error (LOSS) and the horizontal axis represents the number of learnings. (1) of FIG. 10 shows an example of the correct answer rate and the error (LOSS) that vary for each learning (that is, for each recognition test included in the learning). On the other hand, (2) of FIG. 10 shows the median value (black dot) of the correct answer rate and error (LOSS) for a predetermined number of times.

図１０の（１）に示されるとおり、学習毎に認識テストの教師データのばらつきに応じて、正解率や誤差（LOSS）も大きくばらつく。そこで、第３の学習処理の学習工程S100では、プロセッサは、Ｍ回の学習で取得したＭ個の正解率やＭ個の誤差(LOSS)の中央値を、その時の正解率Anまたは誤差（LOSS）として記憶する。 As shown in (1) of FIG. 10, the accuracy rate and the error (LOSS) vary greatly depending on the variation of the teacher data of the recognition test for each learning. Therefore, in the learning step S100 of the third learning process, the processor calculates the median value of M correct answer rates and M errors (LOSS) acquired in M times of learning, the correct answer rate An or error (LOSS) at that time. ).

図６に戻り、本実施の形態の図７、図８、図９の学習処理を適用した場合の学習処理について説明する。図６では縦軸が誤差(LOSS)に対応付けられているのに対して、図７、図８、図９は正解率An、最大正解率Amaxで説明が行われている。そこで、正解率Anの代わりに誤差（LOSS）E、最大正解率Amaxの代わりに誤差（LOSS）の最良値として、説明する。 Returning to FIG. 6, the learning process when the learning process of FIGS. 7, 8, and 9 of the present embodiment is applied will be described. In FIG. 6, the vertical axis is associated with the error (LOSS), while FIGS. 7, 8, and 9 are described with the accuracy rate An and the maximum accuracy rate Amax. Therefore, the error (LOSS) E is used instead of the correct answer rate An, and the best value of the error (LOSS) is used instead of the maximum correct answer rate Amax.

まず、プロセッサは、時間ｔ＝２で重みＷ₂のDNNの誤差（LOSS）を最良値として記憶し（S22）、次の時間ｔ＝３で重みＷ₃のDNNの誤差（LOSS）と最良値との差ｄＥが破綻閾値Kを超えることを検出する。更に、プロセッサは、時間ｔ＝４以降のDNNの誤差（LOSS）と最良値との差ｄＥも破綻閾値Kを超えることが連続して発生することを検出する。その結果、プロセッサは、学習の破綻を検出し（S23、S233）、時間ｔ＝２の時に記憶したパラメータと学習サイクルｎの状態Smaxに戻し、学習率ηを減少させ（S24,S24B）、その状態のDNNで学習を再開する。 First, the processor stores the DNN error (LOSS) of the weight W ₂ as the best value at time t = 2 (S22), and the DNN error (LOSS) of the weight W ₃ and the best value at the next time t = 3. It is detected that the difference dE exceeds the failure threshold K. Further, the processor detects that the difference dE between the DNN error (LOSS) after the time t = 4 and the best value also exceeds the failure threshold value K continuously. As a result, the processor detects the failure of learning (S23, S233), returns the parameter stored at time t = 2 and the state Smax of the learning cycle n, decreases the learning rate η (S24, S24B), Resume learning with the DNN in state.

したがって、再開後の次のパラメータＷは、図６中のパラメータＷ₃よりも左側に位置する。この時、勾配が正になると、次のパラメータＷは負の方向に進み、最良値Ｗ_minに近づいていく。学習率ηは小さくしたままであるので、次のパラメータＷが最良値Ｗ_minの谷から外れることがなく、プロセッサは、最良値Ｗ_minを検出することができる。 Therefore, the next parameter W after the restart is located on the left side of the parameter W _{3 in} FIG. At this time, if the gradient becomes positive, the next parameter W advances in the negative direction and approaches the best value _Wmin . Since the learning rate η is a remain small, without following parameters W deviates from the valley of the best values W _min, the processor can detect the best value W _min.

図６の誤差（LOSS）は、時間ｔ＝２からｔ＝３に移る時大きく（ｄＥ）悪化する。しかし、誤差（LOSS）が時間ｔ＝２後徐々に悪化する方向に進んでいった場合も、図８の処理によれば、最良値Ｗminの時の最小誤差（LOSS）から誤差(LOSS)が徐々に上昇していった結果、プロセッサは、やがて最良値Ｗminの時の最小誤差との差分ｄＥが破綻閾値Ｋを超えることを検出し、且つ、その状態がＴ回連続して発生することを検出し、状態SmaxにDNNを戻し、学習率を低減し、DNNの学習を再開することができる。 The error (LOSS) in FIG. 6 greatly deteriorates (dE) when the time t = 2 shifts to t = 3. However, even when the error (LOSS) progresses in a direction that gradually deteriorates after time t = 2, according to the processing of FIG. 8, the error (LOSS) is reduced from the minimum error (LOSS) at the best value Wmin. As a result of the gradual increase, the processor detects that the difference dE from the minimum error at the time of the best value Wmin exceeds the failure threshold K, and that the state occurs continuously T times. Detect and return DNN to state Smax, reduce the learning rate, and resume DNN learning.

以上のとおり、本実施の形態によれば、プロセッサは、DNNの学習の破綻を検出したとき、保存していた数世代前の状態にDNNのパラメータを戻し、学習率を下げて、DNNの学習を再開することができ、学習の破綻が発生しても、全体の学習工程を短くすることができる。 As described above, according to the present embodiment, when the processor detects a failure of DNN learning, the processor returns the DNN parameters to the saved state several generations ago, lowers the learning rate, and learns DNN. Can be resumed, and even if a learning failure occurs, the entire learning process can be shortened.

ＮＮ：ニューラルネットワーク
ＤＮＮ：ディープニューラルネットワーク
An：正解率
Amax：最良の正解率
ｎ：学習サイクル
Smax：状態
W：重み、DNNのパラメータ
１：ディープニューラルネットワーク装置
２０：DNNプログラム
２２：DNNのパラメータ
２４：DNNの学習プログラム
２６：教師データ、評価データ
Ｅ：誤差関数、差分の二乗和、差分の二乗和の総和 NN: Neural network DNN: Deep neural network
An: Correct answer rate
Amax: Best accuracy rate n: Learning cycle
Smax: Status
W: weight, DNN parameter 1: deep neural network device 20: DNN program 22: DNN parameter 24: DNN learning program 26: teacher data, evaluation data E: error function, sum of squares of difference, sum of squares of difference Total

Claims

教師データを使用してニューラルネットワーク（以下NN）のNNパラメータを最適化するNNの学習方法であって、
第1のNNパラメータが設定されたNNに前記教師データを入力したときの、前記NNの出力と正解値との誤差関数の勾配に学習率を乗じた値を前記第1のNNパラメータから減じて得た第２のNNパラメータに、前記NNパラメータを更新する学習工程と、
前記第２のNNパラメータが設定されたNNに評価データを入力し、前記NNの出力の精度を求める評価工程と、
前記NNの出力の精度が最良値の場合、前記第２のNNパラメータを記憶する工程と、
更に、前記NNの出力の精度が改善されない第１状態になった場合、前記NNパラメータを前記記憶したNNパラメータに戻すと共に、前記学習率を低下させる工程とを有し、
前記第１状態になった場合、前記戻したNNパラメータを設定したNNを、前記低下させた学習率で、前記学習工程を再開する、NNの学習方法。 An NN learning method that optimizes NN parameters of a neural network (hereinafter referred to as NN) using teacher data,
When the teacher data is input to the NN for which the first NN parameter is set, a value obtained by multiplying the gradient of the error function between the output of the NN and the correct value by the learning rate is subtracted from the first NN parameter. A learning step of updating the NN parameter to the obtained second NN parameter;
An evaluation process for inputting evaluation data to an NN in which the second NN parameter is set, and obtaining the accuracy of the output of the NN;
Storing the second NN parameter if the accuracy of the output of the NN is the best value;
Further, when the first state in which the accuracy of the output of the NN is not improved, the NN parameter is returned to the stored NN parameter, and the learning rate is decreased.
An NN learning method in which, when the first state is reached, the learning process is resumed at the reduced learning rate for the NN for which the returned NN parameter is set.

前記第１状態は、前記NNの出力の精度が前記最良値より基準値以上悪化した状態である、請求項１に記載のNNの学習方法。 2. The NN learning method according to claim 1, wherein the first state is a state in which the output accuracy of the NN is deteriorated by a reference value or more from the best value.

前記第１状態は、第１の所定回数の学習サイクルの間前記NNの出力の精度が前記最良値より基準値以上悪化した状態である、請求項１に記載の学習方法。 2. The learning method according to claim 1, wherein the first state is a state in which the accuracy of the output of the NN deteriorates more than a reference value from the best value during a first predetermined number of learning cycles.

前記第１状態は、前記NNの出力の精度が徐々に悪化した結果前記最良値より基準値以上悪化した状態である、請求項１に記載の学習方法。 2. The learning method according to claim 1, wherein the first state is a state in which the accuracy of the output of the NN is gradually deteriorated and is deteriorated by a reference value or more than the best value.

１回の学習サイクルで、前記学習工程と前記評価工程とを第２の所定回数繰り返し、前記第２の所定回数での前記NNの出力の精度の中央値または平均値を、前記学習サイクルでの出力の精度とする、請求項１に記載の学習方法。 In one learning cycle, the learning step and the evaluation step are repeated a second predetermined number of times, and the median or average value of the output accuracy of the NN at the second predetermined number of times is determined in the learning cycle. The learning method according to claim 1, wherein the output accuracy is set.

前記出力の精度は、前記出力が正解した回数の前記評価工程の回数に対する比率である正解率である、請求項１に記載の学習方法。 The learning method according to claim 1, wherein the accuracy of the output is a correct answer rate that is a ratio of the number of times that the output is correctly answered to the number of times of the evaluation step.

前記出力の精度は、前記出力と正解値との間の誤差である、請求項１に記載の学習方法。 The learning method according to claim 1, wherein the accuracy of the output is an error between the output and a correct value.

前記第２のNNパラメータを記憶する工程では、前記NNの出力の精度が従前の最良値の近傍値の場合も、前記第２のNNパラメータを記憶する、請求項１に記載の学習方法。 2. The learning method according to claim 1, wherein in the step of storing the second NN parameter, the second NN parameter is stored even when the accuracy of the output of the NN is a neighborhood value of the previous best value.

前記出力の精度が学習終了条件を満たした場合、学習を終了する、請求項１に記載の学習方法。 The learning method according to claim 1, wherein learning is terminated when the accuracy of the output satisfies a learning termination condition.

教師データを使用してニューラルネットワーク（以下NN）のNNパラメータを最適化する処理をコンピュータに実行させるNNの学習プログラムであって、
前記処理は、
第1のNNパラメータが設定されたNNに前記教師データを入力したときの、前記NNの出力と正解値との誤差関数の勾配に学習率を乗じた値を前記第1のNNパラメータから減じて得た第２のNNパラメータに、前記NNパラメータを更新する学習工程と、
前記第２のNNパラメータが設定されたNNに評価データを入力し、前記NNの出力の精度を求める評価工程と、
前記NNの出力の精度が最良値の場合、前記第２のNNパラメータを記憶する工程と、
更に、前記NNの出力の精度が改善されない第１状態になった場合、前記NNパラメータを前記記憶したNNパラメータに戻すと共に、前記学習率を低下させる工程とを有し、
前記第１状態になった場合、前記戻したNNパラメータを設定したNNを、前記低下させた学習率で、前記学習工程を再開する、
処理を、コンピュータに実行させるNNの学習プログラム。 An NN learning program that causes a computer to execute processing for optimizing NN parameters of a neural network (hereinafter referred to as NN) using teacher data,
The process is
When the teacher data is input to the NN for which the first NN parameter is set, a value obtained by multiplying the gradient of the error function between the output of the NN and the correct value by the learning rate is subtracted from the first NN parameter. A learning step of updating the NN parameter to the obtained second NN parameter;
An evaluation process for inputting evaluation data to an NN in which the second NN parameter is set, and obtaining the accuracy of the output of the NN;
Storing the second NN parameter if the accuracy of the output of the NN is the best value;
Further, when the first state in which the accuracy of the output of the NN is not improved, the NN parameter is returned to the stored NN parameter, and the learning rate is decreased.
When the first state is reached, the learning process is restarted at the reduced learning rate for the NN that sets the returned NN parameter.
An NN learning program that causes a computer to execute processing.

教師データを使用してニューラルネットワーク（以下NN）のNNパラメータを最適化するNNの学習装置であって、
メモリと、
前記メモリに接続されたプロセッサとを有し、
前記プロセッサは、
第1のNNパラメータが設定されたNNに前記教師データを入力したときの、前記NNの出力と正解値との誤差関数の勾配に学習率を乗じた値を前記第1のNNパラメータから減じて得た第２のNNパラメータに、前記NNパラメータを更新する学習工程と、
前記第２のNNパラメータが設定されたNNに評価データを入力し、前記NNの出力の精度を求める評価工程と、
前記NNの出力の精度が最良値の場合、前記第２のNNパラメータを記憶する工程と、
更に、前記NNの出力の精度が改善されない第１状態になった場合、前記NNパラメータを前記記憶したNNパラメータに戻すと共に、前記学習率を低下させる工程とを有し、
前記第１状態になった場合、前記戻したNNパラメータを設定したNNを、前記低下させた学習率で、前記学習工程を再開する、NNの学習装置。 An NN learning device that optimizes NN parameters of a neural network (hereinafter referred to as NN) using teacher data,
Memory,
A processor connected to the memory;
The processor is
When the teacher data is input to the NN for which the first NN parameter is set, a value obtained by multiplying the gradient of the error function between the output of the NN and the correct value by the learning rate is subtracted from the first NN parameter. A learning step of updating the NN parameter to the obtained second NN parameter;
An evaluation process for inputting evaluation data to an NN in which the second NN parameter is set, and obtaining the accuracy of the output of the NN;
Storing the second NN parameter if the accuracy of the output of the NN is the best value;
Further, when the first state in which the accuracy of the output of the NN is not improved, the NN parameter is returned to the stored NN parameter, and the learning rate is decreased.
An NN learning device that resumes the learning process at a reduced learning rate for an NN in which the returned NN parameter is set when the first state is reached.