JP2007128318A

JP2007128318A - State estimation method, state estimation device, state estimation system and computer program

Info

Publication number: JP2007128318A
Application number: JP2005320988A
Authority: JP
Inventors: Atsushi Morimoto; 淳森本; Kenji Dotani; 賢治銅谷
Original assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Current assignee: Japan Science and Technology Agency; ATR Advanced Telecommunications Research Institute International
Priority date: 2005-11-04
Filing date: 2005-11-04
Publication date: 2007-05-24
Anticipated expiration: 2025-11-04
Also published as: JP4811997B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a state estimation method, a state estimation device, a state estimation system and a computer program, allowing easy application to not only a linear dynamics but also a nonlinear dynamics. <P>SOLUTION: This (reinforcement learning) state estimation device 3 performing state estimation by reinforcement learning uses: a simulation model 3a simulating a state of an observation target 1; and a program module and a function such as a reinforcement learning module 3b calculating a feedback value showing a policy of the state estimation on the basis of an estimation result or the like of the state of the observation target 1 by the simulation model 3a, and a reward function 3d or the like calculating a reward value on the basis of an observation result or the like of the observation target 1 and the feedback value calculated by the reinforcement learning module 3b. In the reinforcement learning module 3b, the reinforcement learning based on the reward value is performed such that the feedback value becomes a proper value. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、観測対象の状態を観測した観測値に基づいて前記観測対象の状態を推定する状態推定方法、該状態推定方法を適用した状態推定装置、該状態推定装置を用いた状態推定システム、及び前記状態推定装置を実現するためのコンピュータプログラムに関し、特に線形又は非線形な動きを示し、その動きを観測することで観測対象の状態を推定する状態推定方法、状態推定装置、状態推定システム及びコンピュータプログラムに関する。 The present invention relates to a state estimation method for estimating the state of the observation target based on an observation value obtained by observing the state of the observation target, a state estimation device to which the state estimation method is applied, a state estimation system using the state estimation device, And a computer program for realizing the state estimation device, and in particular, a state estimation method, a state estimation device, a state estimation system, and a computer that show linear or non-linear motion and estimate the state of the observation target by observing the motion Regarding the program.

ロボット等の制御対象を制御する場合、制御対象に組み込まれた各種センサがノイズにより、又はセンサそのものを組み込むことができないことにより、必要な状態変数を直接測定することができない状態が生じうる。また例えば市場予測を行う場合においても、価格変動の要因となる株価等の変数は直接観測することができない。 When controlling a control target such as a robot, a state in which a necessary state variable cannot be directly measured may occur because various sensors incorporated in the control target cannot be incorporated due to noise or the sensor itself. Further, for example, even when market prediction is performed, variables such as stock prices that cause price fluctuations cannot be directly observed.

状態変数を直接観測することができない場合、状態観測器（オブザーバ）（非特許文献１参照）、カルマンフィルタ（非特許文献２、３参照）等の方法が一般に用いられている。
ディー．ジー．ルエンバーガー（D.G.Luenberger），Anintroduction to observers，IEEETrans.,AC,Vol.16，P.596-602，1971 アール．イー．カルマン、アール．エス．ビューシー（R.E.Kalman and R.S.Bucy），New results in linear filtering and prediction theory，Trans.,ASME,Series D,J.ofBasic Engineering,Vol.83,No.1，P.95-108，1961 エフ．エル．ルイス（F.L.Lewis），Optimal Estimation:with an Introduction to Stochastic Control Theory，John Wilkey & Sons，1977 When a state variable cannot be observed directly, methods such as a state observer (observer) (see Non-Patent Document 1), a Kalman filter (see Non-Patent Documents 2 and 3), etc. are generally used.
Dee. Gee. DGLuenberger, Anintroduction to observers, IEEETrans., AC, Vol.16, P.596-602, 1971 R. E. Cullman, Earl. S. Beauty (REKalman and RSBucy), New results in linear filtering and prediction theory, Trans., ASME, Series D, J. of Basic Engineering, Vol. 83, No. 1, P. 95-108, 1961 F. El. FLLewis, Optimal Estimation: with an Introduction to Stochastic Control Theory, John Wilkey & Sons, 1977

しかしながら非特許文献１の状態観測器及び非特許文献２、３のカルマンフィルタでは、対象のダイナミクスが非線形である場合に、隠れ状態の推定が困難になるという問題がある。 However, the state observer of Non-Patent Document 1 and the Kalman filter of Non-Patent Documents 2 and 3 have a problem that it is difficult to estimate a hidden state when the target dynamics is nonlinear.

本発明は斯かる事情に鑑みてなされたものであり、観測対象の状態を模する模擬モデル、模擬モデルによる観測対象の状態の推定結果等に基づき状態推定の方策を示すフィードバック値を算出する強化学習モジュール、強化学習モジュールが算出したフィードバック値及び観測対象の観測結果等に基づき報酬値を算出する報酬関数等のプログラムモジュール及び関数を用い、強化学習モジュールでは、フィードバック値が適正な値になる様に報酬値に基づく強化学習を行うことで、線形のダイナミクスだけでなく、非線形のダイナミクスへも容易に適用することが可能な状態推定方法、該状態推定方法を適用した状態推定装置、該状態推定装置を用いた状態推定システム、及び前記状態推定装置を実現するためのコンピュータプログラムの提供を目的とする。 The present invention has been made in view of such circumstances, and is a simulation model that simulates the state of an observation target, and an enhancement that calculates a feedback value indicating a state estimation policy based on an estimation result of the state of the observation target by the simulation model Program modules and functions such as reward functions that calculate reward values based on the feedback values calculated by the learning module and the reinforcement learning module and the observation results of the observation target, etc. By performing reinforcement learning based on the reward value, a state estimation method that can be easily applied not only to linear dynamics but also to nonlinear dynamics, a state estimation device using the state estimation method, and the state estimation Proposal of a state estimation system using a device and a computer program for realizing the state estimation device The interest.

第１発明に係る状態推定方法は、観測対象の状態を観測した観測結果に基づいて前記観測対象の状態を推定する状態推定装置を用いた状態推定方法において、前記観測対象の状態を模する模擬モデルを用いて、前記観測対象の状態を推定し、前記模擬モデルによる状態の推定結果に基づいて、観測結果を推定した推定観測結果を算出し、推定観測結果及び観測結果、並びに推定観測結果及び観測結果の差を用いて、強化学習モジュールにおける状態推定の方策に基づくフィードバック値を算出し、推定観測結果及び観測結果の差並びにフィードバック値に基づいて、報酬値を算出し、算出した報酬値を用いて強化学習モジュールの方策を更新し、強化学習モジュールにて算出されたフィードバック値に基づいて、前記模擬モデルによる観測対象の状態を推定することを特徴とする。 A state estimation method according to a first aspect of the present invention is a state estimation method using a state estimation device that estimates a state of the observation target based on an observation result obtained by observing the state of the observation target, and simulates the state of the observation target. The model is used to estimate the state of the observation target, and based on the estimation result of the state of the simulated model, the estimated observation result is calculated. The estimated observation result and the observation result, and the estimated observation result and Using the observation result difference, calculate a feedback value based on the state estimation policy in the reinforcement learning module, calculate the reward value based on the estimated observation result, the difference between the observation results and the feedback value, and calculate the calculated reward value. To update the policy of the reinforcement learning module, and based on the feedback value calculated by the reinforcement learning module, And estimating the state.

本発明では、強化学習モジュールが、フィードバック値が適正な値になる様に報酬値に基づく強化学習を行うことで、線形のダイナミクスだけでなく、非線形のダイナミクスへも容易に適用することが可能である。 In the present invention, the reinforcement learning module performs reinforcement learning based on the reward value so that the feedback value becomes an appropriate value, so that it can be easily applied not only to linear dynamics but also to nonlinear dynamics. is there.

第２発明に係る状態推定装置は、観測対象の状態を観測した観測結果に基づいて前記観測対象の状態を推定する状態推定装置において、前記観測対象の状態を推定する模擬モデルと、該模擬モデルによる状態の推定結果に基づいて、観測結果を推定した推定観測結果を算出する手段と、推定観測結果及び観測結果、並びに推定観測結果及び観測結果の差を用いて、強化学習モジュールにおける状態推定の方策に基づくフィードバック値を算出する強化学習モジュールと、推定観測結果及び観測結果の差並びにフィードバック値に基づいて、報酬値を算出する手段と、算出した報酬値を用いて強化学習モジュールの方策を更新する手段と、前記強化学習モジュールにて算出されたフィードバック値に基づいて、前記模擬モデルの観測対象の状態を推定する手段とを備えることを特徴とする。 According to a second aspect of the present invention, there is provided a state estimation device for estimating a state of an observation target based on an observation result of observing the state of the observation target, a simulation model for estimating the state of the observation target, and the simulation model Using the means for calculating the estimated observation result that estimates the observation result based on the state estimation result by, the estimated observation result and the observation result, and the difference between the estimated observation result and the observation result, the state estimation in the reinforcement learning module is performed. Reinforcement learning module that calculates the feedback value based on the policy, means for calculating the reward value based on the estimated observation result and the difference between the observation results and the feedback value, and the policy of the reinforcement learning module is updated using the calculated reward value And the state of the observation target of the simulation model based on the feedback value calculated by the reinforcement learning module Characterized in that it comprises a means for estimating.

第３発明に係る状態推定装置は、第２発明において、前記強化学習モジュールは、報酬値に応じて更新される学習パラメータに基づく平均値及び標準偏差にて示される正規分布に従って分布するフィードバック値を算出する様に構成してあることを特徴とする。 In the state estimation device according to a third aspect, in the second aspect, the reinforcement learning module generates a feedback value distributed according to a normal distribution indicated by an average value and a standard deviation based on a learning parameter updated according to a reward value. It is configured to calculate.

本発明では、状態推定の方策を示すフィードバック値を、学習パラメータに基づく分布関数に従って様々な方策をとるべく展開させることにより、模擬モデルの推定観測結果が観測対象の観測結果に近付く様に、分布関数の平均値及び標準偏差が更新されながら強化学習が繰り返されるので、観測対象の実際の状態を正確に推定することができる様に、フィードバック値が収束していき、非線形のダイナミクスへも容易に展開することが可能である。 In the present invention, the feedback value indicating the state estimation policy is expanded so as to take various strategies according to the distribution function based on the learning parameter, so that the estimated observation result of the simulation model approaches the observation result of the observation target. Reinforcement learning is repeated while updating the average value and standard deviation of the function, so that the feedback value converges so that the actual state of the observation target can be accurately estimated, and nonlinear dynamics can be easily obtained. It is possible to deploy.

第４発明に係る状態推定装置は、第３発明において、前記強化学習モジュールは、報酬値の移動平均を用いた関数に基づいて学習パラメータを更新する様に構成してあることを特徴とする。 According to a fourth aspect of the present invention, in the third aspect of the invention, the reinforcement learning module is configured to update the learning parameter based on a function using a moving average of reward values.

本発明では、強化学習を用いて学習パラメータを更新することにより、ある瞬間での推定誤差を小さくするのではなく、タスクを行っている期間を通じての推定誤差を小さくし、模擬モデルと強化学習モジュールにより推定される状態を適切に観測対象の状態に近付けることが可能である。 In the present invention, by updating the learning parameter using reinforcement learning, the estimation error at a certain moment is not reduced, but the estimation error throughout the period during which the task is performed is reduced. It is possible to appropriately approximate the state estimated by the state of the observation target.

第５発明に係る状態推定システムは、観測対象と、該観測対象の状態を推定する第２発明乃至第４発明のいずれかに記載の状態推定装置と、前記観測対象を制御する制御装置とを備え、前記状態推定装置の模擬モデルは、前記観測対象の状態の推定結果を前記制御装置へ出力する手段を更に備え、前記制御装置は、受け付けた推定結果に基づいて、観測対象を制御する制御命令を生成する手段と、生成した制御命令を前記観測対象へ出力する手段とを備え、前記観測対象は、受け付けた制御命令に従って動作する手段を備えることを特徴とする。 A state estimation system according to a fifth aspect includes an observation target, the state estimation device according to any one of the second to fourth aspects for estimating the state of the observation target, and a control device for controlling the observation target. The simulation model of the state estimation device further includes means for outputting an estimation result of the state of the observation target to the control device, and the control device controls the observation target based on the received estimation result. A means for generating a command and a means for outputting the generated control command to the observation target are provided, and the observation target is provided with a means for operating in accordance with the received control command.

本発明では、強化学習モジュールが、フィードバック値が適正な値になる様に報酬値に基づく強化学習を行うことで、線形のダイナミクスだけでなく、非線形のダイナミクスへも容易に適用することが可能であり、しかも強化学習モジュールの学習結果に基づき模擬モデルにて推定された推定結果に基づいて、観測対象を制御することにより、ロボットの制御等の様々な分野への展開が可能である。 In the present invention, the reinforcement learning module performs reinforcement learning based on the reward value so that the feedback value becomes an appropriate value, so that it can be easily applied not only to linear dynamics but also to nonlinear dynamics. In addition, by controlling the observation target based on the estimation result estimated by the simulation model based on the learning result of the reinforcement learning module, it is possible to expand to various fields such as robot control.

第６発明に係るコンピュータプログラムは、観測対象の状態を観測した観測結果の入力を受け付けるコンピュータに、受け付けた観測結果に基づいて、前記制御対象の状態を推定させるコンピュータプログラムにおいて、コンピュータに、前記観測対象の状態を模する模擬モデルを用いて、前記観測対象の状態を推定させる手順と、コンピュータに、前記模擬モデルによる状態の推定結果に基づいて、観測結果を推定した推定観測結果を算出させる手順と、コンピュータに、推定観測結果及び観測結果、並びに推定観測結果及び観測結果の差を用いて、強化学習モジュールにおける状態推定の方策に基づくフィードバック値を算出させる手順と、コンピュータに、推定観測結果及び観測結果の差並びにフィードバック値に基づいて、報酬値を算出させる手順と、コンピュータに、算出した報酬値を用いて強化学習モジュールの方策を更新させる手順と、コンピュータに、前記強化学習モジュールにて算出されたフィードバック値に基づいて、前記模擬モデルの観測対象の状態を推定させる手順とを実行させることを特徴とする。 According to a sixth aspect of the present invention, there is provided a computer program for causing a computer that receives an input of an observation result obtained by observing an observation target state to estimate the state of the control target based on the received observation result. A procedure for estimating the state of the observation target using a simulation model simulating the state of the target, and a procedure for causing the computer to calculate an estimated observation result obtained by estimating the observation result based on the estimation result of the state based on the simulation model And a procedure for causing the computer to calculate a feedback value based on the state estimation policy in the reinforcement learning module using the estimated observation result and the observation result and the difference between the estimated observation result and the observation result; and Based on differences in observation results and feedback values, reward values The computer to update the policy of the reinforcement learning module using the calculated reward value, and the computer to observe the simulation model based on the feedback value calculated by the reinforcement learning module. And a procedure for estimating the state of the.

本発明では、汎用コンピュータ等のコンピュータにて実行することにより、コンピュータが状態推定装置として動作し、強化学習モジュールが、フィードバック値が適正な値になる様に報酬値に基づく強化学習を行うことで、線形のダイナミクスだけでなく、非線形のダイナミクスへも容易に適用することが可能である。 In the present invention, when executed by a computer such as a general-purpose computer, the computer operates as a state estimation device, and the reinforcement learning module performs reinforcement learning based on the reward value so that the feedback value becomes an appropriate value. It can be easily applied not only to linear dynamics but also to nonlinear dynamics.

本発明に係る状態推定方法、状態推定装置、状態推定システム及びコンピュータプログラムは、様々な装置、自然現象、更には経済現象等の線形又は非線形な動きを示し、その状態の一部又は全部を観測することが可能な観測対象と、該観測対象の状態を観測した観測結果に基づいて前記観測対象の状態を推定する状態推定装置とを備え、特に制御可能な様々な装置等の観測対象を観測する場合に、該観測対象を制御する制御装置を更に備える。そして前記観測対象の状態を模する模擬モデルを用いて、前記観測対象の状態を推定し、前記模擬モデルによる状態の推定結果に基づいて、観測結果を推定した推定観測結果を算出し、推定観測結果及び観測結果、並びに推定観測結果及び観測結果の差を用いて推定結果を評価した報酬値に基づいて、状態推定の方策を示すフィードバック値を算出し、推定観測結果及び観測結果の差並びにフィードバック値に基づいて、報酬値を算出し、フィードバック値に基づいて、前記模擬モデルの観測対象の状態推定方法を更新する。 The state estimation method, state estimation device, state estimation system, and computer program according to the present invention show linear or non-linear movements such as various devices, natural phenomena, and economic phenomena, and observe part or all of the states. A state estimation device that estimates the state of the observation target based on observation results obtained by observing the state of the observation target, and particularly observes various types of controllable observation targets. A control device for controlling the observation target. Then, using the simulation model simulating the state of the observation target, the state of the observation target is estimated, and based on the estimation result of the state by the simulation model, an estimated observation result is calculated by estimating the observation result. Based on the results and observation results, and the reward values obtained by evaluating the estimation results using the differences between the estimated observation results and the observation results, a feedback value indicating a state estimation policy is calculated, and the estimated observation results and the difference between the observation results and the feedback are calculated. The reward value is calculated based on the value, and the state estimation method of the observation target of the simulation model is updated based on the feedback value.

この構成により、本発明では、強化学習モジュールが、フィードバック値が適正な値になる様に報酬値に基づく強化学習を行うことで、線形のダイナミクスだけでなく、非線形のダイナミクスへも容易に適用することが可能である等、優れた効果を奏する。 With this configuration, in the present invention, the reinforcement learning module performs reinforcement learning based on the reward value so that the feedback value becomes an appropriate value, so that it can be easily applied not only to linear dynamics but also to nonlinear dynamics. It is possible to achieve an excellent effect.

さらに前記強化学習モジュールは、報酬値に応じて更新される学習パラメータに基づく平均値及び標準偏差にて示される正規分布に従って分布するフィードバック値を算出する様に構成してあり、報酬値の移動平均を用いた関数に基づいて学習パラメータを更新する様に構成してある。 Furthermore, the reinforcement learning module is configured to calculate an average value based on a learning parameter updated according to a reward value and a feedback value distributed according to a normal distribution indicated by a standard deviation, and a moving average of reward values The learning parameters are updated on the basis of a function using.

この構成により、本発明では、強化学習に基づいて学習パラメータを更新するため、ある瞬間での推定誤差を小さくするのではなく、タスクを行っている期間を通じての推定誤差を小さくし、模擬モデルと強化学習モジュールにより推定される状態を適切に観測対象の状態に近付けることが可能である等、優れた効果を奏する。 With this configuration, in the present invention, since the learning parameter is updated based on reinforcement learning, instead of reducing the estimation error at a certain moment, the estimation error throughout the period during which the task is performed is reduced, and the simulation model and There are excellent effects such as being able to appropriately approximate the state estimated by the reinforcement learning module to the state of the observation target.

以下、本発明をその実施の形態を示す図面に基づいて詳述する。図１は、本発明の状態推定方法の構成例を概念的に示すブロック図である。図１中１は、観測対象（Ｐｌａｎｔ）であり、本発明の状態推定方法は、観測対象１の状態を推定することを目的とする。観測対象１は、様々な装置、自然現象、更には経済現象等の線形又は非線形な動きを示し、その状態の一部又は全部を観測することが可能である。なお本実施の形態では、観測対象１は、制御装置（Ｃｏｎｔｒｏｌｌｅｒ）２により制御することが可能な装置として以降の説明を行う。また観測対象１の状態は、強化学習状態推定装置（ＲＬＳＥ：ＲｅｉｎｆｏｒｃｅｍｅｎｔＬｅａｒｎｉｎｇＳｔａｔｅＥｓｔｉｍａｔｏｒ）３により推定される。 Hereinafter, the present invention will be described in detail with reference to the drawings illustrating embodiments thereof. FIG. 1 is a block diagram conceptually showing a configuration example of a state estimation method of the present invention. In FIG. 1, reference numeral 1 denotes an observation target (Plant), and the state estimation method of the present invention aims to estimate the state of the observation target 1. The observation object 1 shows linear or non-linear movements such as various devices, natural phenomena, and economic phenomena, and can observe part or all of the state. In the present embodiment, the observation target 1 will be described below as a device that can be controlled by a control device 2. The state of the observation target 1 is estimated by a reinforcement learning state estimation device (RLSE: Reinforcement Learning State Estimator) 3.

強化学習状態推定装置３は、例えば汎用コンピュータにて構成されており、観測対象１の状態を推定する模擬モデル（ＰｌａｎｔＭｏｄｅｌ）３ａとして機能するモジュールと、模擬モデル３ａに正確な状態を推定させるべく状態推定のための方策を出力する強化学習モジュール（ＲｅｉｎｆｏｒｃｅｍｅｎｔＬｅａｒｎｉｎｇＭｏｄｕｌｅ）３ｂとを有している。強化学習モジュール３ｂは本発明の状態推定方法において主要な処理を行うプログラムモジュールであり、後述する様々な関数、定数、その他設定値を用いて強化学習を行い、模擬モデル３ａの動きを観測対象１に近付ける方策を展開する。 The reinforcement learning state estimation device 3 is configured by, for example, a general-purpose computer, and functions as a simulation model (Plant Model) 3a for estimating the state of the observation target 1 and the simulation model 3a to estimate an accurate state. And a reinforcement learning module (Reinforcement Learning Module) 3b for outputting a policy for state estimation. The reinforcement learning module 3b is a program module that performs main processing in the state estimation method of the present invention, performs reinforcement learning using various functions, constants, and other setting values described later, and observes the movement of the simulation model 3a as an observation target 1. Develop measures to get closer to.

さらに強化学習状態推定装置３は、模擬モデル３ａが推定した状態から観測結果を推定した推定観測結果を算出する出力関数３ｃ、模擬モデル３ａの推定結果を評価する報酬値を算出する報酬関数３ｄ等の様々な関数を有している。 Further, the reinforcement learning state estimation device 3 includes an output function 3c for calculating an estimated observation result obtained by estimating an observation result from a state estimated by the simulation model 3a, a reward function 3d for calculating a reward value for evaluating the estimation result of the simulation model 3a, and the like. Has various functions.

この様に構成される本発明の状態推定方法において、観測対象１は、制御装置２からの制御命令に基づく制御により動作し、また観測対象１の状態を観測した観測結果は、強化学習状態推定装置３へ入力される。 In the state estimation method of the present invention configured as described above, the observation object 1 is operated by control based on the control command from the control device 2, and the observation result obtained by observing the state of the observation object 1 is the reinforcement learning state estimation. Input to the device 3.

強化学習状態推定装置３の模擬モデル３ａは、設定されている推定のためのモデルに従って観測対象１の状態を推定し、観測対象１の推定結果として制御装置２へ出力する。なお模擬モデル３ａは、強化学習モジュール３ｂにおける状態推定の方策から出力されたフィードバック値の入力に基づいて、状態の推定を行う。そして制御装置２は、入力を受け付けた推定結果に基づいて観測対象１を制御する制御命令を生成し、生成した制御命令を観測対象１及び強化学習状態推定装置３へ出力する。 The simulation model 3a of the reinforcement learning state estimation device 3 estimates the state of the observation target 1 according to the set model for estimation and outputs the estimation result of the observation target 1 to the control device 2. The simulated model 3a estimates the state based on the feedback value input from the state estimation policy in the reinforcement learning module 3b. And the control apparatus 2 produces | generates the control command which controls the observation object 1 based on the estimation result which received the input, and outputs the produced | generated control command to the observation object 1 and the reinforcement learning state estimation apparatus 3. FIG.

なお模擬モデル３ａから出力される推定結果は、強化学習状態推定装置３内の出力関数３ｃ及び強化学習モジュール３ｂでも用いられる。強化学習状態推定装置３では、模擬モデル３ａにより推定した観測対象の状態を示す推定結果から、出力関数３ｃにより、観測対象１の観測結果を推定した推定観測結果を算出し、算出した推定観測結果を報酬関数３ｄへ渡す。 The estimation result output from the simulation model 3a is also used in the output function 3c and the reinforcement learning module 3b in the reinforcement learning state estimation device 3. The reinforcement learning state estimation device 3 calculates an estimated observation result obtained by estimating the observation result of the observation target 1 by the output function 3c from the estimation result indicating the state of the observation target estimated by the simulation model 3a. To the reward function 3d.

報酬関数３ｄは、観測対象１の観測結果、及び出力関数３ｃにて算出された推定観測結果、並びに強化学習モジュール３ｂにおける状態推定の方策から出力されたフィードバック値に基づいて推定を行い、その結果を評価した報酬値を算出し、算出した報酬値を強化学習モジュール３ｂへ渡す。 The reward function 3d performs estimation based on the observation result of the observation target 1, the estimated observation result calculated by the output function 3c, and the feedback value output from the state estimation policy in the reinforcement learning module 3b. Is calculated, and the calculated reward value is passed to the reinforcement learning module 3b.

強化学習モジュール３ｂは、観測対象１の観測結果及び制御装置２から出力された制御命令、並びに模擬モデル３ａによる推定結果及び報酬関数３ｄにて算出した報酬値に基づいて、状態推定の方策を更新し、更新した方策から出力されるフィードバック値を模擬モデル３ａ及び報酬関数３ｄへ渡す。 The reinforcement learning module 3b updates the state estimation policy based on the observation result of the observation target 1, the control command output from the control device 2, the estimation result by the simulation model 3a, and the reward value calculated by the reward function 3d. Then, the feedback value output from the updated policy is passed to the simulation model 3a and the reward function 3d.

図２は、本発明の強化学習状態推定装置３の構成例を示すブロック図である。汎用コンピュータ等のコンピュータを用いた強化学習状態推定装置３は、装置全体を制御するＣＰＵ等の制御手段３１と、本発明の強化学習状態推定装置用のコンピュータプログラム３００及びデータ等の各種情報を記録したＣＤ−ＲＯＭ等の記録媒体３０１から各種情報を読み取るＣＤ−ＲＯＭドライブ等の補助記憶手段３２と、補助記憶手段３２により読み取った各種情報を記録するハードディスク等の記録手段３３と、制御手段３１の制御により一時的に発生するデータを記憶するＲＡＭ等の記憶手段３４とを備えている。そして記録手段３３に記録したコンピュータプログラム３００を記憶手段３４に記憶して制御手段３１の制御により実行することで、汎用コンピュータは、本発明の強化学習状態推定装置３として動作する。さらに強化学習状態推定装置３は、観測対象１の観測結果及び制御装置２の制御命令の入力を受け付ける入力手段３５並びに制御装置２へ推定結果を出力する出力手段３６を備えている。 FIG. 2 is a block diagram showing a configuration example of the reinforcement learning state estimation device 3 of the present invention. The reinforcement learning state estimation device 3 using a computer such as a general-purpose computer records a control means 31 such as a CPU for controlling the entire device, a computer program 300 for the reinforcement learning state estimation device of the present invention, and various information such as data. The auxiliary storage means 32 such as a CD-ROM drive for reading various information from a recording medium 301 such as a CD-ROM, the recording means 33 such as a hard disk for recording various information read by the auxiliary storage means 32, and the control means 31 And a storage means 34 such as a RAM for storing data temporarily generated by the control. The general-purpose computer operates as the reinforcement learning state estimation device 3 of the present invention by storing the computer program 300 recorded in the recording unit 33 in the storage unit 34 and executing it under the control of the control unit 31. Further, the reinforcement learning state estimation device 3 includes an input unit 35 that receives an observation result of the observation target 1 and a control command of the control device 2 and an output unit 36 that outputs the estimation result to the control device 2.

さらに強化学習状態推定装置３が備える記録手段３３には、模擬モデル３ａ、強化学習モジュール３ｂ、出力関数３ｃ、報酬関数３ｄ等の様々なプログラムモジュール及び関数が記録されている。 Furthermore, various program modules and functions such as the simulation model 3a, the reinforcement learning module 3b, the output function 3c, and the reward function 3d are recorded in the recording unit 33 provided in the reinforcement learning state estimation device 3.

次に本発明の強化学習状態推定装置３を用いた状態推定方法について説明する。本発明の状態推定方法は、観測対象１の推定すべき真の状態ｘ及び観測対象１の観測結果（出力値）ｙを用いた下記の式１〜式３にて示される。 Next, a state estimation method using the reinforcement learning state estimation device 3 of the present invention will be described. The state estimation method of the present invention is expressed by the following formulas 1 to 3 using the true state x to be estimated of the observation target 1 and the observation result (output value) y of the observation target 1.

なお上記式１において、ノイズ入力ｎ（ｔ）は、下記の式４となる性質を有している。 In the above formula 1, the noise input n (t) has the property of formula 4 below.

また上記式２において、観測ノイズｖ（ｔ）は、下記の式５となる性質を有している。 Further, in the above formula 2, the observation noise v (t) has the property of the following formula 5.

上記式１において、観測対象１のダイナミクスｆ（ｘ，ｕ）は、観測対象１の観測結果ｙに基づいて、観測対象１の推定状態x^（以降の文章中において、観測対象１の推定状態（推定結果）を示す記号をx^と表記する。）と真の状態ｘとの差異を小さくすべく強化学習することにより取得する関数である。但し、予め既知のダイナミクスｆ（ｘ，ｕ）を設定しておく様にしても良い。なお本発明の状態推定方法では、模擬モデル３ａにより推定した観測対象の状態ｘの推定結果x^から算出した推定観測結果y^（以降の文章中において、推定観測結果を示す記号をy^と表記する。）と、観測対象１の実際の観測結果ｙとの差異が小さい程、高い評価となる下記の式６の報酬関数を用いることにより、推定結果x^を実際の状態ｘに近付けるべく強化学習を行う。 In Equation 1 above, the dynamics f (x, u) of the observation object 1 is based on the observation result y of the observation object 1 and the estimated state x ^ of the observation object 1 (in the following sentence, the estimated state of the observation object 1 A symbol indicating (estimation result) is expressed as x ^.) And a function obtained by reinforcement learning so as to reduce the difference between the true state x and the true state x. However, the known dynamics f (x, u) may be set in advance. In the state estimation method of the present invention, the estimated observation result y ^ calculated from the estimation result x ^ of the observation target state x estimated by the simulation model 3a (in the following text, the symbol indicating the estimated observation result is represented as y ^. In order to bring the estimated result x ^ closer to the actual state x by using the reward function of the following equation 6 that becomes higher as the difference between the actual observation result y of the observation object 1 is smaller. Perform reinforcement learning.

式６中の推定結果の正確さを評価する関数ｅ（）は、実際の観測結果ｙと、出力関数３ｃとの差異が小さい程、大きい値をとり、関数ｃ（ａ）は、強化学習モジュール３ｂにおける方策の出力の大きさが大きい程、大きい値をとるように設定される。 The function e () for evaluating the accuracy of the estimation result in Equation 6 takes a larger value as the difference between the actual observation result y and the output function 3 c is smaller, and the function c (a) is a reinforcement learning module. The larger the policy output in 3b is, the larger the value is set.

強化学習モジュール３ｂでは、式６に示す報酬関数ｒに基づく学習処理を繰り返すことで、模擬モデル３ａの状態推定の方策を更新し、更新した状態推定の方策からフィードバック値ａを算出し、模擬モデル３ａへ出力する。なお状態推定の方策を示すフィードバック値ａは、不明な状態から下記の式７に示す確率分布に従って様々な方策をとるべく展開され、模擬モデル３ａの推定観測結果y^が観測対象１の観測結果ｙに近付く様に、強化学習の繰り返しにて収束する。 In the reinforcement learning module 3b, the learning process based on the reward function r shown in Expression 6 is repeated to update the state estimation policy of the simulation model 3a, and the feedback value a is calculated from the updated state estimation policy. Output to 3a. The feedback value a indicating the state estimation policy is developed from various unknown states according to the probability distribution shown in Equation 7 below, and the estimated observation result y ^ of the simulated model 3a is the observation result of the observation target 1. As it approaches y, it converges by repeating reinforcement learning.

式７にて示される様にフィードバック値ａは、平均値μ、分散σ² の正規分布を示し、分散σ² が小さい程、方策のバラツキが小さくなる。 As shown in Expression 7, the feedback value a indicates a normal distribution having an average value μ and a variance σ ^2. The smaller the variance σ ² , the smaller the variation in policy.

式７中の平均値μは、下記の式８〜式１０にて示され、学習すべきパラメータを更新していくことにより学習が行われる。 The average value μ in Expression 7 is expressed by the following Expressions 8 to 10, and learning is performed by updating parameters to be learned.

式７中の分散σ² の平方根である標準偏差σは、下記の式１１にて示され、学習すべきパラメータを更新していくことにより学習が行われる。 The standard deviation σ, which is the square root of the variance σ ^{2 in} Equation 7, is expressed by Equation 11 below, and learning is performed by updating parameters to be learned.

一方、強化学習においては、式７及び式８のパラメータを更新するために下記の式１２にて示される価値関数Ｑが用いられる。 On the other hand, in the reinforcement learning, the value function Q shown in the following Expression 12 is used to update the parameters of Expression 7 and Expression 8.

次に学習すべきパラメータの更新方法について説明する。強化学習状態推定装置３の強化学習モジュール３ｂでは、報酬関数３ｄにて算出された報酬値Ｒ（ｔ）に基づき、予め設定されている時定数τを用いた直近の移動平均値を下記の式１５にて計算する。 Next, a method for updating parameters to be learned will be described. In the reinforcement learning module 3b of the reinforcement learning state estimation device 3, based on the reward value R (t) calculated by the reward function 3d, the most recent moving average value using a preset time constant τ is expressed by the following equation: Calculate at 15.

また平均報酬の近似誤差Δ（ｔ）を下記の式１６にて計算する。 Further, an approximate error Δ (t) of the average reward is calculated by the following equation 16.

式１６の右辺の第１項は、報酬ｒを式１５にて示した報酬値Ｒ（ｔ）の平均値の差であり、右辺の第２項に示した価値関数Ｑ（ｔ）の時間微分（Ｑの上方にドットを付与した記号にパラメータ（ｔ）を付して表記）が大きく近似誤差Δ（ｔ）が大きいほど、現在の状態推定の方策が正しいことになる。更に近似誤差Δ（ｔ）を修正するための価値関数Ｑのパラメータの更新方法を下記の式１７及び式１８に示す。 The first term on the right side of Equation 16 is the difference between the average values of the reward values R (t) in which the reward r is shown in Equation 15, and the time derivative of the value function Q (t) shown in the second term on the right side. The larger the approximation error Δ (t) is (the symbol with a dot above Q is attached with a parameter (t)) and the larger the approximation error Δ (t), the more correct the current state estimation strategy is. Further, the updating method of the parameter of the value function Q for correcting the approximate error Δ (t) is shown in the following Expression 17 and Expression 18.

また式１７及び式１８に示した関数の出力の時間に関する加重平均値の更新は下記の式１９及び式２０を用いて行われる。 The weighted average value regarding the output time of the function shown in Expression 17 and Expression 18 is updated using Expression 19 and Expression 20 below.

そして式１３及び式１４にて示した関数の出力を用い下記の式２１及び式２２により学習すべきパラメータを更新する。 Then, the parameters to be learned are updated by the following formulas 21 and 22 using the outputs of the functions shown in the formulas 13 and 14.

図３は、本発明の強化学習状態推定装置３の処理を示すフローチャートである。上述した様に本発明の強化学習状態推定装置３は、コンピュータプログラム３００を実行する制御手段３１の制御により、強化学習モジュール３ｂにより、上述した様々な関数及び数式にて、状態推定の方策に基づくフィードバック値を算出し（ステップＳ１）、模擬モデル３ａにて式１を用いて観測対象の状態を推定し（ステップＳ２）、模擬モデル３ａによる状態の推定結果に基づいて、出力関数３ｃにより、観測結果を推定した推定観測結果を算出し（ステップＳ３）、式６を用いて推定観測結果及び観測結果の差並びに強化学習モジュール３ｂにて算出されるフィードバック値に基づいて報酬値を算出し（ステップＳ４）、算出した報酬値を用いて強化学習モジュール３ｂの方策を更新する（ステップＳ５）。そして強化学習状態推定装置３は、制御手段３１の制御に基づいて、強化学習モジュール３ｂにより、算出したフィードバック値に基づいて模擬モデル３ａによる観測対象の状態を推定する（ステップＳ６）。 FIG. 3 is a flowchart showing processing of the reinforcement learning state estimation device 3 of the present invention. As described above, the reinforcement learning state estimation device 3 according to the present invention is based on the state estimation policy using the various functions and formulas described above by the reinforcement learning module 3b under the control of the control unit 31 that executes the computer program 300. The feedback value is calculated (step S1), the state of the observation target is estimated using the equation 1 in the simulation model 3a (step S2), and the output function 3c The estimated observation result obtained by estimating the result is calculated (step S3), and the reward value is calculated based on the difference between the estimated observation result and the observation result using Equation 6 and the feedback value calculated by the reinforcement learning module 3b (step S3). S4) The policy of the reinforcement learning module 3b is updated using the calculated reward value (step S5). The reinforcement learning state estimation device 3 estimates the state of the observation target by the simulation model 3a based on the calculated feedback value by the reinforcement learning module 3b based on the control of the control unit 31 (step S6).

次に本発明の状態推定方法を用いて行った実験結果について説明する。図４は、本発明の状態推定方法の実験に用いた観測対象である単振り子を示す模式図である。観測対象である単振り子の重りの質量はｍ、振り子長はｌであり、外力Ｔを加えることにより支点ｏを中心とする揺動を開始する。なお図４に示した単振り子の力学モデルは、下記の式２３にて示される。 Next, experimental results performed using the state estimation method of the present invention will be described. FIG. 4 is a schematic diagram showing a simple pendulum that is an observation target used in the experiment of the state estimation method of the present invention. The mass of the weight of the single pendulum to be observed is m, the pendulum length is l, and when an external force T is applied, the oscillation around the fulcrum o is started. The dynamic model of the simple pendulum shown in FIG.

なお当実験において、μ＝０．０１、ｍ＝１．０kg、ｌ＝１．０ｍ、そしてｇ＝９．８m/s²である。 In this experiment, μ = 0.01, m = 1.0 kg, l = 1.0 m, and g = 9.8 m / s ² .

ここで観測対象である振り子の状態を示す状態ベクトルをｘ＝（θ，ω）T とし、ｕ＝Ｔとすると、上述した式１及び式２は、下記の式２４及び式２５として示される。 Here, if the state vector indicating the state of the pendulum to be observed is x = (θ, ω) T and u = T, the above-described Expression 1 and Expression 2 are expressed as Expression 24 and Expression 25 below.

なお式２４及び式２５において、ノイズｎ（ｔ）及びノイズｖ（ｔ）は、上述した式４及び式５に対し、Ｕ＝ｄｉａｇ｛０．０１、０．０１｝、Ｓ＝１．０を代入することにより表現することができる。 In Expression 24 and Expression 25, the noise n (t) and the noise v (t) are U = diag {0.01, 0.01} and S = 1.0 with respect to Expression 4 and Expression 5 described above. It can be expressed by substitution.

また上述した式３の観測対象の推定状態は、当実験において下記の式２６にて示される。 Further, the estimated state of the observation target of the above-described expression 3 is expressed by the following expression 26 in this experiment.

なお式２６において、右辺の第３項は、状態推定のフィードバック値である。 In Equation 26, the third term on the right side is a state estimation feedback value.

そして上述した式６に示す報酬関数ｒは、当実験において下記の式２７にて示される。 The reward function r shown in Equation 6 is expressed by Equation 27 below in this experiment.

なお当実験においてσr ＝０．５である。また式２７中において、知識の正確さに依存した関数ｃ（ａj ）は、フィードバック値ａの最大値であるａ^max＝（ａ₁ ^max，ａ₂ ^max）＝（５．０，５．０）を用いた下記の式２８にて示される。 In this experiment, σr = 0.5. In Expression 27, the function c (aj) depending on the accuracy of knowledge is the maximum value of the feedback value a, a ^max = (a ₁ ^max , a ₂ ^max ) = (5.0, 5.0) The following equation 28 is used.

当実験に際し、学習率は、下記の式２９及び式３０にて示される値を用いた。 In this experiment, the learning rate used was a value represented by the following Equation 29 and Equation 30.

さらに当実験に際し、時定数τ＝０．２sec 及びeligibility trace の時定数κ＝０．２sec との条件設定を行った。 Further, in this experiment, conditions were set such that a time constant τ = 0.2 sec and an eligibility trace time constant κ = 0.2 sec.

また強化学習によるフィードバック値の更新周期は、０．０２sec である。 The feedback value update cycle by reinforcement learning is 0.02 sec.

図５乃至図７は、本発明の状態推定方法の実験結果を示すグラフである。なお図５乃至図６において（ａ）は、角度θのみが観測される場合、（ｂ）は、角速度ωのみが観測される場合、そして（ｃ）は、角度及び角速度ωの線形和θ＋ωのみが観測される場合の結果を示している。図５は、上述した条件に基づく実験において、時間と観測値との関係を示しており、図５（ａ）は、角度θの観測値の経時変化を示しており、図５（ｂ）は、角速度ωの観測値の経時変化を示しており、そして図５（ｃ）は、観測値がθ＋ωである場合の経時変化を示している。図５は、いずれも横軸に時間をとり、縦軸に観測値をとって、その関係を示している。図５に示す様に観測値はいずれも非線形に変動する。 5 to 7 are graphs showing experimental results of the state estimation method of the present invention. 5 to 6, (a) shows only the angle θ, (b) shows only the angular velocity ω, and (c) shows only the linear sum θ + ω of the angle and angular velocity ω. The result when is observed is shown. FIG. 5 shows the relationship between the time and the observed value in the experiment based on the above-described conditions. FIG. 5A shows the change over time of the observed value of the angle θ, and FIG. FIG. 5C shows the change with time of the observed value of the angular velocity ω, and FIG. 5C shows the change with time when the observed value is θ + ω. FIG. 5 shows the relationship in which time is plotted on the horizontal axis and observed values are plotted on the vertical axis. As shown in FIG. 5, the observed values fluctuate nonlinearly.

図６は、横軸に時間をとり、縦軸に角度θをとって、実際の角度及び推定した角度の経時変化を示している。図６（ａ）は、図５（ａ）に示した角度の観測値に基づいて推定した角度と、実際の角度との関係を示している。この実験では、図６（ａ）に示す様に１秒程度で推定した角度が実際の角度に一致している。図６（ｂ）は、図５（ｂ）に示した角速度の観測値に基づいて推定した角度と、実際の角度との関係を示している。この実験では、図６（ｂ）に示す様に３秒程度で推定した角度が実際の角度に一致している。図６（ｃ）は、図５（ｃ）に示した角度及び角速度の和の観測値に基づいて推定した角度と、実際の角度との関係を示している。この実験では、図６（ｃ）に示す様に２秒程度で推定した角度が実際の角度に一致している。 FIG. 6 shows changes with time in actual and estimated angles, with time on the horizontal axis and angle θ on the vertical axis. FIG. 6A shows the relationship between the angle estimated based on the observed value of the angle shown in FIG. 5A and the actual angle. In this experiment, as shown in FIG. 6A, the angle estimated in about 1 second coincides with the actual angle. FIG. 6B shows the relationship between the angle estimated based on the observed value of the angular velocity shown in FIG. 5B and the actual angle. In this experiment, as shown in FIG. 6B, the angle estimated in about 3 seconds coincides with the actual angle. FIG. 6C shows the relationship between the angle estimated based on the observed value of the sum of the angle and the angular velocity shown in FIG. 5C and the actual angle. In this experiment, as shown in FIG. 6C, the angle estimated in about 2 seconds coincides with the actual angle.

図７は、横軸に時間をとり、縦軸に角速度ωをとって、実際の角速度及び推定した角速度の経時変化を示している。図７（ａ）は、図５（ａ）に示した角度の観測値に基づいて推定した角速度と、実際の角速度との関係を示している。この実験では、図７（ａ）に示す様に２秒程度で推定した角速度が実際の角速度に一致している。図７（ｂ）は、図５（ｂ）に示した角速度の観測値に基づいて推定した角速度と、実際の角速度との関係を示している。この実験では、図７（ｂ）に示す様に２秒程度で推定した角速度が実際の角速度に一致している。図７（ｃ）は、図５（ｃ）に示した角度及び角速度の和の観測値に基づいて推定した角速度と、実際の角速度との関係を示している。この実験では、図７（ｃ）に示す様に２秒程度で推定した角度が実際の角度に一致している。 FIG. 7 shows changes with time in the actual angular velocity and the estimated angular velocity, with time on the horizontal axis and angular velocity ω on the vertical axis. FIG. 7A shows the relationship between the angular velocity estimated based on the observed angle value shown in FIG. 5A and the actual angular velocity. In this experiment, as shown in FIG. 7A, the angular velocity estimated in about 2 seconds matches the actual angular velocity. FIG. 7B shows the relationship between the angular velocity estimated based on the observed value of the angular velocity shown in FIG. 5B and the actual angular velocity. In this experiment, as shown in FIG. 7B, the angular velocity estimated in about 2 seconds matches the actual angular velocity. FIG. 7C shows the relationship between the actual angular velocity and the angular velocity estimated based on the observed value of the sum of the angle and angular velocity shown in FIG. In this experiment, as shown in FIG. 7C, the angle estimated in about 2 seconds matches the actual angle.

この様に本発明では、非線形のダイナミクスの状態を容易に推定することが可能である。 Thus, in the present invention, it is possible to easily estimate the state of nonlinear dynamics.

前記実施の形態では、単振り子を観測対象とする実験を示したが、本発明はこれに限らず、様々な装置、自然現象、更には経済現象等の線形又は非線形な動きを示す観測対象の状態推定に適用することが可能である。 In the above-described embodiment, an experiment using a single pendulum as an observation target has been shown. However, the present invention is not limited to this, and various devices, natural phenomena, and observation targets that exhibit linear or non-linear movement such as economic phenomena are also illustrated. It can be applied to state estimation.

本発明の状態推定方法の構成例を概念的に示すブロック図である。It is a block diagram which shows notionally the structural example of the state estimation method of this invention. 本発明の強化学習状態推定装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the reinforcement learning state estimation apparatus of this invention. 本発明の強化学習状態推定装置の処理を示すフローチャートである。It is a flowchart which shows the process of the reinforcement learning state estimation apparatus of this invention. 本発明の状態推定方法の実験に用いた観測対象である単振り子を示す模式図である。It is a schematic diagram which shows the simple pendulum which is an observation object used for the experiment of the state estimation method of this invention. 本発明の状態推定方法の実験結果を示すグラフである。It is a graph which shows the experimental result of the state estimation method of this invention. 本発明の状態推定方法の実験結果を示すグラフである。It is a graph which shows the experimental result of the state estimation method of this invention. 本発明の状態推定方法の実験結果を示すグラフである。It is a graph which shows the experimental result of the state estimation method of this invention.

符号の説明Explanation of symbols

１観測対象
２制御装置
３強化学習状態推定装置
３ａ模擬モデル
３ｂ強化学習モジュール
３ｃ出力関数
３ｄ報酬関数
３００コンピュータプログラム
３０１記録媒体 DESCRIPTION OF SYMBOLS 1 Observation object 2 Control apparatus 3 Reinforcement learning state estimation apparatus 3a Simulation model 3b Reinforcement learning module 3c Output function 3d Reward function 300 Computer program 301 Recording medium

Claims

観測対象の状態を観測した観測結果に基づいて前記観測対象の状態を推定する状態推定装置を用いた状態推定方法において、
前記観測対象の状態を模する模擬モデルを用いて、前記観測対象の状態を推定し、
前記模擬モデルによる状態の推定結果に基づいて、観測結果を推定した推定観測結果を算出し、
推定観測結果及び観測結果、並びに推定観測結果及び観測結果の差を用いて、強化学習モジュールにおける状態推定の方策に基づくフィードバック値を算出し、
推定観測結果及び観測結果の差並びにフィードバック値に基づいて、報酬値を算出し、
算出した報酬値を用いて強化学習モジュールの方策を更新し、
強化学習モジュールにて算出されたフィードバック値に基づいて、前記模擬モデルによる観測対象の状態を推定する
ことを特徴とする状態推定方法。 In the state estimation method using the state estimation device that estimates the state of the observation target based on the observation result of observing the state of the observation target,
Using a simulation model that simulates the state of the observation target, estimating the state of the observation target,
Based on the estimation result of the state by the simulation model, calculate the estimated observation result by estimating the observation result,
Using the estimated observation result and the observation result, and the difference between the estimated observation result and the observation result, a feedback value based on the state estimation policy in the reinforcement learning module is calculated,
Based on the estimated observation result, the difference between the observation results and the feedback value, the reward value is calculated,
Update the strategy of reinforcement learning module using the calculated reward value,
A state estimation method characterized by estimating a state of an observation target by the simulation model based on a feedback value calculated by a reinforcement learning module.

観測対象の状態を観測した観測結果に基づいて前記観測対象の状態を推定する状態推定装置において、
前記観測対象の状態を推定する模擬モデルと、
該模擬モデルによる状態の推定結果に基づいて、観測結果を推定した推定観測結果を算出する手段と、
推定観測結果及び観測結果、並びに推定観測結果及び観測結果の差を用いて、強化学習モジュールにおける状態推定の方策に基づくフィードバック値を算出する強化学習モジュールと、
推定観測結果及び観測結果の差並びにフィードバック値に基づいて、報酬値を算出する手段と、
算出した報酬値を用いて強化学習モジュールの方策を更新する手段と、
前記強化学習モジュールにて算出されたフィードバック値に基づいて、前記模擬モデルの観測対象の状態を推定する手段と
を備えることを特徴とする状態推定装置。 In the state estimation device that estimates the state of the observation target based on the observation result of observing the state of the observation target,
A simulation model for estimating the state of the observation target;
Means for calculating an estimated observation result obtained by estimating the observation result based on the estimation result of the state by the simulation model;
A reinforcement learning module that calculates a feedback value based on a state estimation policy in the reinforcement learning module using the estimated observation result and the observation result, and the difference between the estimated observation result and the observation result;
Means for calculating a reward value based on the estimated observation result, the difference between the observation results, and the feedback value;
Means for updating the strategy of the reinforcement learning module using the calculated reward value;
A state estimation apparatus comprising: means for estimating a state of an observation target of the simulation model based on a feedback value calculated by the reinforcement learning module.

前記強化学習モジュールは、報酬値に応じて更新される学習パラメータに基づく平均値及び標準偏差にて示される正規分布に従って分布するフィードバック値を算出する様に構成してあることを特徴とする請求項２に記載の状態推定装置。 The reinforcement learning module is configured to calculate a feedback value distributed according to a normal distribution indicated by an average value and a standard deviation based on a learning parameter updated according to a reward value. 2. The state estimation device according to 2.

前記強化学習モジュールは、報酬値の移動平均を用いた関数に基づいて学習パラメータを更新する様に構成してあることを特徴とする請求項３に記載の状態推定装置。 The state estimation apparatus according to claim 3, wherein the reinforcement learning module is configured to update a learning parameter based on a function using a moving average of reward values.

観測対象と、
該観測対象の状態を推定する請求項２乃至請求項４のいずれかに記載の状態推定装置と、
前記観測対象を制御する制御装置と
を備え、
前記状態推定装置の模擬モデルは、前記観測対象の状態の推定結果を前記制御装置へ出力する手段を更に備え、
前記制御装置は、
受け付けた推定結果に基づいて、観測対象を制御する制御命令を生成する手段と、
生成した制御命令を前記観測対象へ出力する手段と
を備え、
前記観測対象は、受け付けた制御命令に従って動作する手段を備える
ことを特徴とする状態推定システム。 The observation object,
The state estimation device according to any one of claims 2 to 4, which estimates the state of the observation target,
A control device for controlling the observation object,
The simulation model of the state estimation device further comprises means for outputting an estimation result of the state of the observation target to the control device,
The controller is
Means for generating a control command for controlling the observation target based on the received estimation result;
Means for outputting the generated control command to the observation target,
The observation object includes means for operating according to a received control command.

観測対象の状態を観測した観測結果の入力を受け付けるコンピュータに、受け付けた観測結果に基づいて、前記制御対象の状態を推定させるコンピュータプログラムにおいて、
コンピュータに、前記観測対象の状態を模する模擬モデルを用いて、前記観測対象の状態を推定させる手順と、
コンピュータに、前記模擬モデルによる状態の推定結果に基づいて、観測結果を推定した推定観測結果を算出させる手順と、
コンピュータに、推定観測結果及び観測結果、並びに推定観測結果及び観測結果の差を用いて、強化学習モジュールにおける状態推定の方策に基づくフィードバック値を算出させる手順と、
コンピュータに、推定観測結果及び観測結果の差並びにフィードバック値に基づいて、報酬値を算出させる手順と、
コンピュータに、算出した報酬値を用いて強化学習モジュールの方策を更新させる手順と、
コンピュータに、前記強化学習モジュールにて算出されたフィードバック値に基づいて、前記模擬モデルの観測対象の状態を推定させる手順と
を実行させることを特徴とするコンピュータプログラム。 In the computer program for estimating the state of the control target based on the received observation result, the computer that receives the input of the observation result observing the state of the observation target,
Using a simulation model that imitates the state of the observation target, using a computer to estimate the state of the observation target;
A procedure for causing a computer to calculate an estimated observation result obtained by estimating an observation result based on the estimation result of the state by the simulation model;
A procedure for causing a computer to calculate a feedback value based on a state estimation policy in the reinforcement learning module using the estimated observation result and the observation result and the difference between the estimated observation result and the observation result;
A procedure for causing the computer to calculate a reward value based on the estimated observation result, the difference between the observation results, and the feedback value;
A procedure for causing the computer to update the policy of the reinforcement learning module using the calculated reward value;
A computer program causing a computer to execute a procedure for estimating a state of an observation target of the simulation model based on a feedback value calculated by the reinforcement learning module.