JP6872226B2

JP6872226B2 - Decision maker

Info

Publication number: JP6872226B2
Application number: JP2017016294A
Authority: JP
Inventors: 敬志土屋; 寺部　一弥; 一弥寺部; 徹鶴岡; 成主金; 青野　正和; 正和青野
Original assignee: National Institute for Materials Science
Current assignee: National Institute for Materials Science
Priority date: 2017-01-31
Filing date: 2017-01-31
Publication date: 2021-05-19
Anticipated expiration: 2037-01-31
Also published as: JP2018124790A

Description

本発明は、事象情報を電気信号にして与えときに報酬確率の高い行動を選択する意思決定装置に関する。 The present invention relates to a decision-making device that selects an action having a high reward probability when event information is given as an electric signal.

近年、高効率な意思決定の重要性が増している。例えば、金融においては、刻一刻と変動する相場情報を基に安全に危険資産の管理を行う必要がある。コグニティブ無線では、端末の位置や時間帯によって最適な無線方式、周波数帯を選択する必要がある。囲碁、将棋といった競技は変動する環境で意思決定が問題となる典型例であり、近年、人間とコンピュータとの対戦が話題となっている。 In recent years, the importance of highly efficient decision making has increased. For example, in finance, it is necessary to safely manage dangerous assets based on market information that fluctuates from moment to moment. In cognitive radio, it is necessary to select the optimum radio method and frequency band according to the position and time zone of the terminal. Competitions such as Go and Shogi are typical examples where decision-making becomes a problem in a fluctuating environment, and in recent years, competition between humans and computers has become a hot topic.

こうした問題は、多本腕バンディット問題として取り扱われ、通常、ＳＯＦＴＭＡＸ法やε−ＧＲＥＥＤＹ法といった従来型アルゴリズムを用いた計算処理により解決される。しかし、このような手法は万能ではなく、より高速かつ正確な解法が求められている。
近年、こうした多本腕バンディット問題の効率的な解法として「綱引き原理」が提案された（非特許文献１から３、及び特許文献１）。例えば、報酬確率の異なる２つの行動を選択する場合、それぞれの行動に対する試行錯誤において得られる報酬に応じて変位（綱引き）する物体を用いることによって、より報酬確率の高い行動を選択する。これを意思決定と呼ぶ。
図１を参照しながら報酬確率８０％の行動Ａと２０％の行動Ｂの２つの行動を選択する場合を考える。行動ＡとＢの報酬確率はプレイヤーにとって未知であるため、それぞれの行動を選択し報酬を得る、あるいは得られないという経験を基に報酬確率を予測し、より報酬確率の高い行動を選択（意思決定）する。綱引き原理では、プレイヤーが行動ＡやＢを選択し、得た報酬に応じて物体を刻一刻と変位させていくことによって、より報酬確率の高い行動を選択（意思決定）する。例えば、試行錯誤の過程で行動Ａを選択し、報酬を得た場合は＋１、報酬を得られなかった場合は‐ωの変位を物体に与える。逆に行動Ｂを選択し、報酬を得た場合は‐１、報酬を得られなかった場合は＋ωの変位を物体に与える。物体の変位がどちらかに偏ることにより、選択（意思決定）をしたと見做せばよい。ここで、ωはγ/２‐γで定義される。図１の場合、γは行動Ａの報酬確率（８０％）と行動Ｂの報酬確率（２０％）の和を１００で割った値である１．０となる（非特許文献１）。 These problems are treated as multi-armed bandit problems and are usually solved by computational processing using conventional algorithms such as the SOFTMAX method and the ε-GREEDY method. However, such a method is not universal, and a faster and more accurate solution is required.
In recent years, the "tug-of-war principle" has been proposed as an efficient solution to such a multi-armed bandit problem (Non-Patent Documents 1 to 3 and Patent Document 1). For example, when selecting two actions having different reward probabilities, an action having a higher reward probability is selected by using an object that displaces (tug of war) according to the reward obtained by trial and error for each action. This is called decision making.
Consider a case where two actions, an action A having a reward probability of 80% and an action B having a reward probability of 20%, are selected with reference to FIG. Since the reward probabilities of actions A and B are unknown to the player, the reward probabilities are predicted based on the experience of selecting each action and getting or not getting the reward, and the action with the higher reward probability is selected (intention). decide. In the tug-of-war principle, the player selects actions A and B, and the object is displaced every moment according to the reward obtained, so that the action with a higher reward probability is selected (decision-making). For example, in the process of trial and error, action A is selected, and if a reward is obtained, a displacement of +1 is given to the object, and if no reward is obtained, a displacement of -ω is given to the object. Conversely, if action B is selected and a reward is obtained, a displacement of -1 is given to the object, and if no reward is obtained, a displacement of + ω is given to the object. It can be considered that the choice (decision-making) is made by the displacement of the object being biased to either side. Here, ω is defined by γ / 2-γ. In the case of FIG. 1, γ is 1.0, which is the sum of the reward probability of action A (80%) and the reward probability of action B (20%) divided by 100 (Non-Patent Document 1).

綱引き原理は、従来手法と比較すると報酬確率の高い行動への収束が高速であるだけでなく、環境（それぞれの行動が持つ報酬確率）の変化に対して適応性が高いという利点を有している。さらに、他の解法が計算処理に依拠するプログラムであることに対して、綱引き原理は物理現象に依拠するため、プログラムにおいて問題となる計算処理量の増大とそれに伴って生じる処理数の限界を回避することが可能となる。 The tug-of-war principle has the advantage that it not only converges to actions with a high reward probability faster than the conventional method, but also has high adaptability to changes in the environment (reward probability of each action). There is. Furthermore, since the tug-of-war principle relies on physical phenomena, while other solutions are programs that rely on computational processing, the increase in the amount of computational processing that becomes a problem in programs and the limit on the number of processes that accompanies it are avoided. It becomes possible to do.

綱引き原理を用いた意思決定手段を様々な物理現象を利用して実装して、強化学習に用いる試みがなされている（非特許文献４から７）。例えば、ナノダイヤモンドの窒素欠陥を光子源として用いると、単一光子の粒子性と確率性を利用することで綱引き原理を物理的に実装することが出来る（非特許文献６）。しかし、このような方法では大規模な光学回路が必要となるため、デバイス、回路の小型化には適さないという課題が残る。また、比較的小さな空間で金属フィラメントの生成・切断を行い意思決定に用いようとする試みもある（非特許文献７）。しかしながらこの方法は、綱引き原理を原理上精度良く再現出来ないという根本的な問題を内包しており、実用的とは言い難い。このように、綱引き原理に正確に基づき、かつ小型化可能なデバイスによって意思決定するという意思決定装置の課題は解決されていない。 Attempts have been made to implement decision-making means using the tug-of-war principle using various physical phenomena and use them for reinforcement learning (Non-Patent Documents 4 to 7). For example, when the nitrogen defect of nanodiamond is used as a photon source, the tug-of-war principle can be physically implemented by utilizing the particle nature and probability of a single photon (Non-Patent Document 6). However, since such a method requires a large-scale optical circuit, there remains a problem that it is not suitable for miniaturization of devices and circuits. There is also an attempt to generate and cut a metal filament in a relatively small space and use it for decision making (Non-Patent Document 7). However, this method has a fundamental problem that the tug-of-war principle cannot be reproduced accurately in principle, and it is hard to say that it is practical. As described above, the problem of the decision-making apparatus of making a decision by a device that is accurately based on the tug-of-war principle and can be miniaturized has not been solved.

特開２０１４−１９１５９８号公報Japanese Unexamined Patent Publication No. 2014-191598

ＮｅｗＪ．Ｐｈｙｓ．，ｖｏｌ．１７，ｐ．０８３０２３（２０１５）New J. Phys. , Vol. 17, p. 083023 (2015) Ｂｉｏｓｙｓｔｅｍｓ，ｖｏｌ．１０１，ｐｐ．２９−３６（２０１０）Biosystems, vol. 101, pp. 29-36 (2010) ＬＮＣＳ，ｖｏｌ．６０７９，ｐｐ．６９−８０（２０１０）LNCS, vol. 6079, pp. 69-80 (2010) Ｓｃｉ．Ｒｅｐ．，ｖｏｌ．３，ｐ．２３７０（２０１３）Sci. Rep. , Vol. 3, p. 2370 (2013) Ｊ．Ａｐｐｌ．Ｐｈｙｓ．，ｖｏｌ．１１６，ｐ．１５４３０３（２０１４）J. Apple. Phys. , Vol. 116, p. 154303 (2014) ＡＩＭＳＭａｔｅｒ．Ｓｃｉ．，ｖｏｌ．３，ｐｐ．２４５−２５９（２０１６）AIMS Mater. Sci. , Vol. 3, pp. 245-259 (2016) ＤＯＩ：１０．１０３９／ｃ６ｎｒ００６９０ｆ（２０１６）DOI: 10.1039 / c6nr00690f (2016)

本発明の課題は、簡易で小型化可能なデバイスにより、綱引き原理に正確に基づいて意思決定が可能な意思決定装置を提供することである。 An object of the present invention is to provide a decision-making device capable of making a decision based on a tug-of-war principle by a simple and miniaturized device.

本発明の構成を下記に示す。
（構成１）
電荷の蓄積により学習を行う学習手段、報酬確率の異なる行動の中から選択した行動に応じた電荷を前記学習手段に与える電荷供給手段、及び前記学習手段の電圧を読み取る電圧読み取り手段を有し、前記電圧読み取り手段で読み取った電圧により報酬確率の異なる行動の中から選択する行動を決定する意思決定装置であって、
前記学習手段は、電場によるイオンの輸送が可能な電解質材料層を同一の材料からなる２以上の電極で挟んだ電解質素子からなる、意思決定装置。
（構成２）
前記２以上の電極間に前記電荷の流入による電流を流して前記イオンを輸送し、前記電極間に電圧を生じさせる、構成１記載の意思決定装置。
（構成３）
前記イオンの前記２以上の電極のうちの少なくとも１の電極側への移動または電極内への侵入により、前記２以上の電極に電子及び正孔が生成されて電圧が発生する、構成１または２記載の意思決定装置。
（構成４）
前記電解質材料層は液体電解質または固体電解質を含む、構成１から３の何れか１に記載の意思決定装置。
（構成５）
前記液体電解質は、テトラメチルアンモニウムイオン（ＴＭＡ^＋）、テトラエチルアンモニウムイオン（ＴＥＡ^＋）、テトラブチルアンモニウムイオン（ＴＢＡ^＋）、テトラフルオロホウ酸イオン（ＢＦ_４ ⁻）、Ｎ，Ｎ−ジエチル−Ｎ−メチル−Ｎ−（２−メトキシエチル）アンモニウム−ビス（トリフルオロメタンスルホニル）イミド（ＤＥＭＥ−ＴＦＳＩ）、Ｎ，Ｎ−ジエチル−Ｎ−メチル−Ｎ−（２−メトキシエチル）アンモニウム−テトラフルオロボラート（ＤＥＭＥ−ＢＦ_４）からなる群の少なくとも１を含む、構成４記載の意思決定装置。
（構成６）
前記電解質材料は可動イオンを有する高分子化合物を含む構成４に記載の意思決定装置。
（構成７）
前記高分子化合物はポリエチレンオキシドまたはナフィオンの少なくとも何れかを含む、構成６に記載の意思決定装置。
（構成８）
前記電解質材料層は可動イオンを有する金属酸化物またはケイ酸（ＳｉＯ_２）の少なくとも何れかを含む、構成４に記載の意思決定装置。
（構成９）
前記金属酸化物は、酸化セリウム（ＣｅＯ_２）、酸化タンタル（Ｔａ_２Ｏ_５）、酸化ジルコニウム（ＺｒＯ_２）、酸化ニオブ（Ｎｂ_２Ｏ_５）、酸化タングステン（ＷＯ_３）、酸化リチウム（Ｌｉ_２Ｏ）からなる群の少なくとも１を含む、構成８に記載の意思決定装置。
（構成１０）
前記２以上の電極は電子伝導性を有する金属または半導体の少なくとも何れかを含む、構成４に記載の意思決定装置。
（構成１１）
前記金属は、金、白金、銀、パラジウム、アルミニウム、鉄、銅、タングステン、チタン、タンタルからなる群の少なくとも１を含む、構成１０に記載の意思決定装置。
（構成１２）
前記半導体は、炭素、シリコン、コバルト酸リチウムからなる群の少なくとも１を含む、構成１０に記載の意思決定装置。
（構成１３）
前記金属及び半導体は、電場下でイオンとの化学反応が可能な活性物質を含む、構成１０に記載の意思決定装置。
（構成１４）
前記金属及び半導体は、電場下でイオン輸送が可能な電解質を含み、前記電解質材料層内及び前記２以上の電極のうちの一方の電極内のイオンが移動して他方の電極内に前記イオンが侵入する、構成１０に記載の意思決定装置。
（構成１５）
前記意思決定装置は配線切替手段を有する、構成１から１４の何れかに記載の意思決定装置。

The configuration of the present invention is shown below.
(Structure 1)
A learning means for learning by the accumulation of charge, charge supply means for providing a charge corresponding to actions selected from among different actions of reward probability to said learning means, and the voltage reading means for reading the voltage of said learning means , a decision device for determining an action to choose from different behavior of the voltage by Ri reward probability voltage read by the reading means,
The learning means is a decision-making device including an electrolyte element in which an electrolyte material layer capable of transporting ions by an electric field is sandwiched between two or more electrodes made of the same material.
(Structure 2)
The decision-making apparatus according to configuration 1, wherein a current due to an inflow of electric charge is passed between the two or more electrodes to transport the ions, and a voltage is generated between the electrodes.
(Structure 3)
Configuration 1 or 2 in which electrons and holes are generated in the two or more electrodes by the movement of the ions toward the electrode side or invasion into the electrodes of at least one of the two or more electrodes. The decision-making device described.
(Structure 4)
The decision-making apparatus according to any one of configurations 1 to 3, wherein the electrolyte material layer contains a liquid electrolyte or a solid electrolyte.
(Structure 5)
The liquid electrolyte, tetramethylammonium ion ^(TMA +), tetraethylammonium ion ^(TEA +), tetrabutyl ammonium ion ^(TBA +), tetrafluoroborate ion _{^{(BF 4 -), N,}} N- diethyl--N- Methyl-N- (2-methoxyethyl) ammonium-bis (trifluoromethanesulfonyl) imide (DEME-TFSI), N, N-diethyl-N-methyl-N- (2-methoxyethyl) ammonium-tetrafluoroborate ( The decision-making apparatus according to configuration 4, which comprises at least one of the group consisting of DEME-BF _4).
(Structure 6)
The decision-making apparatus according to configuration 4, wherein the electrolyte material contains a polymer compound having mobile ions.
(Structure 7)
The decision-making apparatus according to configuration 6, wherein the polymer compound contains at least one of polyethylene oxide and naphthion.
(Structure 8)
The decision-making apparatus according to configuration 4, wherein the electrolyte material layer contains at least one of a metal oxide having mobile ions or silicic acid (SiO _2).
(Structure 9)
The metal oxides include cerium oxide (CeO ₂ ), tantalum pentoxide (Ta ₂ O ₅ ), zirconium oxide (ZrO ₂ ), niobium oxide (Nb ₂ O ₅ ), tungsten oxide (WO ₃ ), and lithium oxide (Li _2). 8. The decision-making apparatus according to configuration 8, comprising at least one of the group consisting of O).
(Structure 10)
The decision-making apparatus according to configuration 4, wherein the two or more electrodes include at least one of a metal or a semiconductor having electron conductivity.
(Structure 11)
10. The decision-making apparatus according to configuration 10, wherein the metal comprises at least one of the group consisting of gold, platinum, silver, palladium, aluminum, iron, copper, tungsten, titanium and tantalum.
(Structure 12)
The decision-making apparatus according to configuration 10, wherein the semiconductor includes at least one of the group consisting of carbon, silicon, and lithium cobalt oxide.
(Structure 13)
The decision-making apparatus according to configuration 10, wherein the metal and the semiconductor include an active substance capable of chemically reacting with ions under an electric field.
(Structure 14)
The metal and the semiconductor include an electrolyte capable of transporting ions under an electric field, and ions in the electrolyte material layer and in one of the two or more electrodes move to move the ions into the other electrode. The decision-making apparatus according to configuration 10, which invades.
(Structure 15)
The decision-making device according to any one of configurations 1 to 14, wherein the decision-making device has wiring switching means.

本発明によれば、簡易で小型化可能なデバイスにより、綱引き原理に正確に基づいて意思決定が可能な意思決定装置を提供することが可能になる。 According to the present invention, it is possible to provide a decision-making device capable of making a decision based on a tug-of-war principle by a simple and miniaturized device.

綱引き原理による意思決定を説明する概念図。A conceptual diagram explaining decision-making based on the tug-of-war principle. 意思決定装置の構成を示す構成図。The block diagram which shows the structure of the decision-making apparatus. 電源スイッチ部の構成を電気回路で示す回路図。A circuit diagram showing the configuration of the power switch section with an electric circuit. 電源スイッチ部の構成を電気回路で示す回路図。A circuit diagram showing the configuration of the power switch section with an electric circuit. 電解質素子の構成を示す断面図。The cross-sectional view which shows the structure of an electrolyte element. 電解質素子の動作原理を示す説明図。Explanatory drawing which shows the operating principle of an electrolyte element. 学習、意思決定過程における電解質素子の電気特性を説明する説明図。Explanatory drawing explaining the electrical property of an electrolyte element in a learning and decision-making process. 電解質素子の構成を示す断面図。The cross-sectional view which shows the structure of an electrolyte element. 電解質素子の動作原理を示す説明図。Explanatory drawing which shows the operating principle of an electrolyte element. 意思決定装置の構成を示す構成図。The block diagram which shows the structure of the decision-making apparatus. 学習記憶装置部の構成を示す構成図。The block diagram which shows the structure of the learning storage device part. 学習記憶装置部の動作原理を示す説明図。Explanatory drawing which shows the operation principle of a learning storage device part. 学習記憶装置部の動作原理を示す説明図。Explanatory drawing which shows the operation principle of a learning storage device part. 報酬確率（Ｐ_Ａ，Ｐ_Ｂ）を（８０％、２０％）としたときの電解質素子の起電力の変化を示す特性図。Compensation probability _(P A, _{P B)} (80%, 20%) characteristic diagram showing the change of the electromotive force of the electrolyte element when the. 報酬確率（Ｐ_Ａ，Ｐ_Ｂ）を（８０％、２０％）と（２０％、８０％）で繰り返し切り替えた場合の正答確率の推移を示す特性図。 _A characteristic diagram showing the transition of the correct answer probability when the reward probability (PA, P _B ) is repeatedly switched between (80%, 20%) and (20%, 80%). 報酬確率（Ｐ_Ａ，Ｐ_Ｂ）を（７０％、３０％）と（３０％、７０％）で繰り返し切り替えた場合の正答確率の推移を示す特性図。 _A characteristic diagram showing the transition of the correct answer probability when the reward probability (PA, P _B ) is repeatedly switched between (70%, 30%) and (30%, 70%). 報酬確率（Ｐ_Ａ，Ｐ_Ｂ）を（６０％、４０％）と（４０％、６０％）で繰り返し切り替えた場合の正答確率の推移を示す特性図。 _A characteristic diagram showing the transition of the correct answer probability when the reward probability (PA, P _B ) is repeatedly switched between (60%, 40%) and (40%, 60%).

以下本発明を実施するための形態を図面を参照しながら説明する。
（実施の形態１）
＜意思決定装置の構成＞
本発明の意思決定装置は、電荷の蓄積により学習を行う学習手段、事象の行動に応じた電荷を学習手段に与える電荷供給手段、及び学習手段の電圧を読み取る電圧読み取り手段からなり、その構成を図２に示す。
ここで、電荷の蓄積により学習を行う学習手段は、電場によるイオン輸送が可能な電解質材料層を２以上の電極で挟んだ電解質素子１１からなる。
電荷供給手段は、事象の行動の学習をさせるための入力信号を基に電源から電荷を供給する電源スイッチからなり、電圧を読み取る手段は電圧計１４からなる。電圧計は、この回路を流れる電流に対してなるべく影響を与えないように、高抵抗（高インピーダンス）のものを用いることが好ましい。
電源スイッチは、電源と入力信号により電圧の印加と切断、電圧の正負及びその電圧の大きさの調整を行う機能を有する。図１では、電源スイッチは、電解質素子１１に入力信号１５を基に第１の電圧を印加及びその切断が可能な第１の電源スイッチ１２と、入力信号１６を基に第１の電源とは逆向きの電圧を印加及びその切断することが可能な第２の電源スイッチ１３からなる場合を示す。但し、これは一例であり、電源スイッチは、１つの電源から入力信号を基に、電解質素子１１に正負を含む所定の電圧を印加したり、電圧の印加を中断したりすることが可能なスイッチを有するものでもよい。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings.
(Embodiment 1)
<Configuration of decision-making device>
The decision-making apparatus of the present invention comprises a learning means for learning by accumulating electric charges, a charge supplying means for giving charges to the learning means according to the behavior of an event, and a voltage reading means for reading the voltage of the learning means. It is shown in FIG.
Here, the learning means for learning by accumulating electric charges includes an electrolyte element 11 in which an electrolyte material layer capable of ion transport by an electric field is sandwiched between two or more electrodes.
The charge supply means includes a power switch that supplies charges from a power source based on an input signal for learning the behavior of an event, and a means for reading a voltage includes a voltmeter 14. It is preferable to use a voltmeter having a high resistance (high impedance) so as not to affect the current flowing through this circuit as much as possible.
The power switch has a function of applying and disconnecting a voltage, positive and negative of a voltage, and adjusting the magnitude of the voltage by a power supply and an input signal. In FIG. 1, the power supply switch is a first power supply switch 12 capable of applying and disconnecting a first voltage to the electrolyte element 11 based on the input signal 15, and a first power supply based on the input signal 16. A case is shown in which the second power supply switch 13 is capable of applying and disconnecting a reverse voltage. However, this is an example, and the power switch is a switch capable of applying a predetermined voltage including positive and negative to the electrolyte element 11 or interrupting the application of the voltage based on an input signal from one power supply. It may have.

電源スイッチ１２としては、例えば図３に示すように、ＭＯＳトランジスタスイッチ２１、直流電源２２、可変抵抗２３からなるものが挙げられる。学習を与えるための入力信号１３がＭＯＳトランジスタ２１のゲート２４に入力されると、ＭＯＳトランジスタ２１がオンの状態になって、電解質素子１１に電圧が印加される。入力信号１３が入力されない場合は、ＭＯＳトランジスタ２１はオフの状態になって電解質素子１１へは電圧は印加されない。ここで、電解質素子１１に印加される電圧の大きさは可変抵抗２３によって所定の値に調整される。
電源スイッチ１３としては、例えば図４に示すように、ＭＯＳトランジスタスイッチ２５、直流電源２６、可変抵抗２７からなるものが挙げられる。ここで、直流電源２６は、直流電源２２の電圧の正負とは逆の正負を与える電源にしておく。学習を与えるための入力信号１６がＭＯＳトランジスタ２５のゲート２８に入力されると、ＭＯＳトランジスタ２５がオンの状態になって、電解質素子１１に電源スイッチ１２からの電圧とは逆向きの電圧が印加される。入力信号１６が入力されない場合は、ＭＯＳトランジスタ２５はオフの状態になって電解質素子１１へは電圧は印加されない。ここで、電源スイッチ１２と同様に、電解質素子１１に印加される電圧の大きさは可変抵抗２７によって所定の値に調整される。 Examples of the power switch 12 include a MOS transistor switch 21, a DC power supply 22, and a variable resistor 23, as shown in FIG. When the input signal 13 for giving learning is input to the gate 24 of the MOS transistor 21, the MOS transistor 21 is turned on and a voltage is applied to the electrolyte element 11. When the input signal 13 is not input, the MOS transistor 21 is turned off and no voltage is applied to the electrolyte element 11. Here, the magnitude of the voltage applied to the electrolyte element 11 is adjusted to a predetermined value by the variable resistor 23.
Examples of the power switch 13 include a MOS transistor switch 25, a DC power supply 26, and a variable resistor 27, as shown in FIG. Here, the DC power supply 26 is set to be a power supply that gives positive and negative values opposite to the positive and negative voltages of the DC power supply 22. When the input signal 16 for giving learning is input to the gate 28 of the MOS transistor 25, the MOS transistor 25 is turned on and a voltage opposite to the voltage from the power switch 12 is applied to the electrolyte element 11. Will be done. When the input signal 16 is not input, the MOS transistor 25 is turned off and no voltage is applied to the electrolyte element 11. Here, similarly to the power switch 12, the magnitude of the voltage applied to the electrolyte element 11 is adjusted to a predetermined value by the variable resistor 27.

＜電解質素子の構造＞
実施の形態１では、その構成と機能をわかりやすくすることも考慮して、電極が２つからなる電解質素子１１（２端子電解質素子１１）の場合について説明する。
電解質素子１１の構造を断面図である図５に示す。電解質素子１１は、陰イオン１と陽イオン２が移動出来る電解質材料層３を第１の電極４と第２の電極５で挟んだ積層構造になっている。電流印加による効果は、第１の電極４と第２の電極５との間の電圧（起電力）として測定可能である。 <Structure of electrolyte element>
In the first embodiment, the case of the electrolyte element 11 (two-terminal electrolyte element 11) having two electrodes will be described in consideration of making the configuration and the function easy to understand.
The structure of the electrolyte element 11 is shown in FIG. 5, which is a cross-sectional view. The electrolyte element 11 has a laminated structure in which an electrolyte material layer 3 capable of moving anions 1 and cations 2 is sandwiched between a first electrode 4 and a second electrode 5. The effect of applying a current can be measured as a voltage (electromotive force) between the first electrode 4 and the second electrode 5.

なお、図５及び以降の概念図は本発明を概念的に示すものであるため、実際の構造がこれらの図に示す構造と完全に相似形となることが必要とされるわけではないし、またこれらの図には明示されていない要素を追加したり、同等な別の要素で置換することもできる。 Since the conceptual diagrams of FIG. 5 and the following are conceptually showing the present invention, it is not necessary that the actual structure is completely similar to the structures shown in these figures. Elements not explicitly shown in these figures can be added or replaced with other equivalent elements.

電解質材料層３の材料としては、例えば、液体電解質であるテトラメチルアンモニウム−テトラフルオロボラート（ＴＭＡ−ＢＦ_４）を用いることができる。電解質としては、テトラメチルアンモニウムイオン（ＴＭＡ^＋）、テトラエチルアンモニウムイオン（ＴＥＡ^＋）、テトラブチルアンモニウムイオン（ＴＢＡ^＋）、テトラフルオロホウ酸イオン（ＢＦ_４ ⁻）、Ｎ，Ｎ−ジエチル−Ｎ−メチル−Ｎ−（２−メトキシエチル）アンモニウム−ビス（トリフルオロメタンスルホニル）イミド（ＤＥＭＥ−ＴＦＳＩ）、Ｎ，Ｎ−ジエチル−Ｎ−メチル−Ｎ−（２−メトキシエチル）アンモニウム−テトラフルオロボラート（ＤＥＭＥ−ＢＦ_４）からなる群の少なくとも１を含む液体電解質を使用することもできる。また、電解質材料には電解質以外に各種の添加物を加えることもできる。また、電解質の材料としては他に固体電解質、可動イオンを含む高分子化合物、可動イオンを有する金属酸化物及びケイ酸（ＳｉＯ_２）も使用可能である。
ここで、可動イオンを含む高分子化合物としては、ポリエチレンオキシド、ナフィオンを挙げることができ、可動イオンを有する金属酸化物としては、酸化セリウム(ＣｅＯ_２)、酸化タンタル（Ｔａ_２Ｏ_５）、酸化ジルコニウム（ＺｒＯ_２）、酸化ニオブ（Ｎｂ_２Ｏ_５）、酸化タングステン（ＷＯ_３）、酸化リチウム（Ｌｉ_２Ｏ）を挙げることができる。 As the material of the electrolyte material layer 3, for example, tetramethylammonium-tetrafluoroborate (TMA-BF ₄ ), which is a liquid electrolyte, can be used. As the electrolyte, tetramethylammonium ion ^(TMA +), tetraethylammonium ion ^(TEA +), tetrabutyl ammonium ion ^(TBA +), tetrafluoroborate ion _{^{(BF 4 -), N,}} N- diethyl--N- methyl -N- (2-Methoxyethyl) ammonium-bis (trifluoromethanesulfonyl) imide (DEME-TFSI), N, N-diethyl-N-methyl-N- (2-methoxyethyl) ammonium-tetrafluoroborate (DEME) A liquid electrolyte containing at least 1 of the group consisting of −BF _{4) can also be used.} In addition to the electrolyte, various additives can be added to the electrolyte material. Further, as the material of the electrolyte, a solid electrolyte, a polymer compound containing mobile ions, a metal oxide having mobile ions, and silicic acid (SiO ₂ ) can also be used.
Here, examples of the polymer compound containing mobile ions include polyethylene oxide and naphthion, and examples of the metal oxide having mobile ions include cerium oxide (CeO ₂ ), tantalum oxide (Ta ₂ O ₅ ), and oxidation. Examples thereof include zirconium (ZrO ₂ ), niobium oxide (Nb ₂ O ₅ ), tungsten oxide (WO ₃ ), and lithium oxide (Li ₂ O).

第１の電極４及び第２の電極５の材料としては、例えば、電解質との化学反応について比較的不活性であるグラファイトを用いることができる。グラファイト以外にも、電子伝導性を有する金属、例えば、金、白金、銀、パラジウム、アルミニウム、鉄、銅、タングステン、チタン、タンタルを用いることができる。また、第１の電極４及び第２の電極５として、電子伝導性を有する半導体、例えば、炭素、シリコン、コバルト酸リチウムを用いることもできる。これらの金属及び半導体は、電場下でイオンとの化学反応が可能な活性物質を含んでいる。 As the material of the first electrode 4 and the second electrode 5, for example, graphite, which is relatively inactive with respect to the chemical reaction with the electrolyte, can be used. In addition to graphite, metals having electron conductivity, for example, gold, platinum, silver, palladium, aluminum, iron, copper, tungsten, titanium, and tantalum can be used. Further, as the first electrode 4 and the second electrode 5, a semiconductor having electron conductivity, for example, carbon, silicon, or lithium cobalt oxide can be used. These metals and semiconductors contain active substances that are capable of chemically reacting with ions under an electric field.

＜意思決定装置の動作＞
図６と図７を参照しながら、本発明の動的に強化学習可能な意思決定装置の動作を説明する。図６は、図５に示した２端子電解質素子１１に対して第２の電極側から電流を流すことによって、第１の電極４と第２の電極５の間の電圧（起電力）を変化させることができることを示している。 <Operation of decision-making device>
The operation of the dynamically reinforcement learning decision-making device of the present invention will be described with reference to FIGS. 6 and 7. FIG. 6 shows that the voltage (electromotive force) between the first electrode 4 and the second electrode 5 is changed by passing a current from the second electrode side to the two-terminal electrolyte element 11 shown in FIG. It shows that it can be made to.

図６に示す電解質素子１１を作製した段階（原点状態）では、図５に示す様に、電解質材料層３内には陰イオン１と陽イオン２が均一に分布している。次に、電解質素子１１の第１の電極４側から電流を流すと、電解質材料層３内の負の電荷を有する陰イオン１は、第１の電極４と電解質材料層３の界面（以下、第１の電極側界面と称する。また、第２の電極５と電解質材料層３との界面を第２の電極側界面と称する。）付近に移動し、場合によっては一部が第１の電極内に侵入して、濃化する。このとき、陰イオン１の濃化により、第１の電極４には正の電荷ｈ^＋（正の極性の伝導キャリア）が蓄積される。一方、第１の電極と対向する第２の電極５においては、陰イオン１が減少して、正の極性のイオンである陽イオン２が残される。そのため、第２の電極５には負の電荷ｅ⁻（負の極性の伝導キャリア）が蓄積される。
この状態は平行極板キャパシタに蓄電したのと類似の状態であるので、第１の電極４と第２の電極５との間に、第１の電極４を正の極性とした電圧（Ｖで表し、第１の電極４側の電圧を印加電圧の極性とする。）が起電力として生じる。ここで、この起電力Ｖは、流れる電流や電解質材料層３内におけるイオン伝導度、イオン輸率によって変化する。なお、電流を流す時間は数ミリ秒から数秒が好ましい。 At the stage where the electrolyte element 11 shown in FIG. 6 is manufactured (origin state), as shown in FIG. 5, anions 1 and cations 2 are uniformly distributed in the electrolyte material layer 3. Next, when a current is passed from the first electrode 4 side of the electrolyte element 11, the negatively charged anion 1 in the electrolyte material layer 3 is at the interface between the first electrode 4 and the electrolyte material layer 3 (hereinafter, hereinafter, It is referred to as a first electrode-side interface. Further, the interface between the second electrode 5 and the electrolyte material layer 3 is referred to as a second electrode-side interface), and in some cases, a part thereof is referred to as a first electrode. It invades inside and thickens. At this time, due to the enrichment of the anion 1, a positive charge h ⁺ (a positive polar conduction carrier) is accumulated in the first electrode 4. On the other hand, in the second electrode 5 facing the first electrode, the anion 1 is reduced and the cation 2 which is a positively polar ion is left. Therefore, a negative charge e ⁻ (a conduction carrier having a negative polarity) is accumulated in the second electrode 5.
Since this state is similar to the state in which electricity is stored in the parallel electrode plate capacitor, a voltage (at V) between the first electrode 4 and the second electrode 5 with the first electrode 4 having a positive polarity. The voltage on the first electrode 4 side is defined as the polarity of the applied voltage)) as the electromotive force. Here, the electromotive force V changes depending on the flowing current, the ionic conductivity in the electrolyte material layer 3, and the ion transport number. The time for passing the current is preferably several milliseconds to several seconds.

本装置で生じた起電力Ｖは電流により蓄積した電荷によるものなので、電流を停止して回路を開放しても起電力はすぐには失われない。そして、さらに電流を流すことにより、起電力を増減させることが可能である。 Since the electromotive force V generated by this device is due to the electric charge accumulated by the current, the electromotive force is not immediately lost even if the current is stopped and the circuit is opened. Then, the electromotive force can be increased or decreased by further passing an electric current.

次に、図７を用いて強化学習及び意思決定の手順を、報酬確率Ｐ_Ａ、Ｐ_Ｂ（％）を持つ二つの行動Ａ、Ｂの選択を行う場合を例に挙げて説明する。意思決定装置１００は、結果的に、正の起電力（電圧）を示す場合は行動Ａを選択し、Ｐ_Ａの確率で報酬を得るとする。逆に１００‐Ｐ_Ａ（％）の確率で報酬は得られない。同様に、負の起電力（電圧）を示す場合は行動Ｂを選択し、Ｐ_Ｂの確率で報酬を得るとする。このときは、１００‐Ｐ_Ｂ（％）の確率で報酬が得られない。
図７のｔ_１に示す時点で起電力を正と判定すると、行動Ａを選択するのでＰ_Ａの確率で報酬を得るが、装置上では、この報酬に対応する予め定めた値の正の電流を一定時間流しておく。正の電流により、起電力Ｖは正の極性で増大する（Ｖ_１に対応）。電流を止めて回路を一定時間開くと、起電力Ｖの減衰が起こる（Ｖ_２に対応）。回路を開いた状態でｔ_２の時点で起電力Ｖを判定した後、ｔ_２の時点から再び電流（この場合は上記とは逆向きの電流）を流し、ｔ_３の時点で起電力Ｖを判定する。そして、ｔ_３の時点で回路を開き（Ｖ_４に対応）、同様の過程を繰り返す。図７の時刻ｔ_１からｔ_２の過程（図７のＴ_１）を１回の試行とし、この試行を繰り返し行う。試行回数を増すに従い、起電力が正、もしくは負に偏っていく。これを以て装置が行動Ａ、もしくは行動Ｂを選択したと判断する。例えば、Ｐ_Ａ＞Ｐ_Ｂであれば、正の起電力に偏っていくとき、意思決定装置１００は報酬確率がより高い行動を正しく選択したと解釈される。 Next, reinforcement learning and decision-making procedures, will be described as an example the case of performing compensation probability P _A, two actions A with P B _(%), the selection of B with reference to FIG. Decision device 100, consequently, if a positive electromotive force (voltage) selects an action A, and obtaining a reward with a probability of P _A. Reward probability Conversely _100-P A (%) can not be obtained. Similarly, if it shows a negative electromotive force (voltage), it is assumed that action B is selected and a reward is obtained with a probability of _{P B.} In this case, the reward cannot be obtained with a probability of _{100-P B (%).}
When determining the electromotive force as positive when shown in t ₁ in FIG. 7, to obtain a compensation with a probability of P _A so selecting an action A, it is on the device, a positive current of a predetermined value corresponding to the remuneration Let it flow for a certain period of time. The positive current, the electromotive force V is (corresponding to V ₁₎ to increase a positive polarity. Opening a circuit constant time stop current (corresponding to V ₂₎ attenuation occurs electromotive force V. After determining the electromotive force V _{at t 2} with the circuit open, the current (in this case, the current in the opposite direction to the above) is passed again from the time _{t 2} _{, and the electromotive force V is calculated at t 3.} judge. Then, (corresponding to _{V 4)} to open the circuit at the time of _{t 3,} it repeats the same process. The process from time t ₁ to t ₂ in FIG. 7 (T _{1 in} FIG. 7) is regarded as one trial, and this trial is repeated. As the number of trials increases, the electromotive force becomes positive or negative. Based on this, it is determined that the device has selected action A or action B. For example, _{if PA} > P _B , it is interpreted that the decision-making device 100 correctly selects an action with a higher reward probability when it is biased toward a positive electromotive force.

本発明の意思決定装置１００では、事象の行動に応じて電荷を電解質素子１１に蓄積させていき、試行を繰り返した結果、最終的に蓄積された電荷による起電力により意思決定を行っている。本発明では、この電荷の蓄積素子として電気化学動作を行う電解質を用いたことが１つの要となっている。 In the decision-making apparatus 100 of the present invention, electric charges are accumulated in the electrolyte element 11 according to the behavior of an event, and as a result of repeating trials, decision-making is made by electromotive force due to the finally accumulated electric charges. One of the key points of the present invention is the use of an electrolyte that performs an electrochemical operation as the charge storage element.

例えば、電解質素子１１に置き換えて、電子を蓄積するコンデンサーを電荷蓄積素子として用いた場合を考える。コンデンサーの場合は、電流印加によって蓄積された電荷をＱとすると、報酬確率の変動に対応するために失われなければならない電荷も‐Ｑになる。コンデンサーの場合はこの関係性が厳密に成立する。意思決定工程を、パチンコを例に例えて言うと、１台のパチンコ台を使って１０万円儲けた遊戯者は、その台で１０万円以上損をするまでその台を諦められない状態に相当し、賢い意思決定とは言い難い状況になる。 For example, consider a case where a capacitor that stores electrons is used as a charge storage element instead of the electrolyte element 11. In the case of a capacitor, where Q is the charge accumulated by applying the current, the charge that must be lost to cope with the fluctuation of the reward probability is also -Q. In the case of capacitors, this relationship is strictly established. Taking a pachinko machine as an example of the decision-making process, a player who earns 100,000 yen using one pachinko machine cannot give up on that machine until he loses more than 100,000 yen. It is a situation that is not a wise decision.

一方、電荷蓄積素子として電解質素子を用いている本発明では、電気化学反応が進行することにより電荷が少しずつ失われていくため、報酬確率の変動に対応するために失われなければならないＱはかなり小さくなる。上記のパチンコの例で言うと、１０万円儲かった台で例えば３万円損をした段階で見切りをつけて他の台を選択するという判断が可能になり、より賢い意思決定ができる。 On the other hand, in the present invention in which the electrolyte element is used as the charge storage element, the charge is gradually lost as the electrochemical reaction proceeds, so that the Q that must be lost in order to cope with the fluctuation of the reward probability is It will be quite small. In the case of the above pachinko machine, it becomes possible to make a decision to give up and select another pachinko machine when a machine that has earned 100,000 yen has lost 30,000 yen, for example, and a smarter decision-making can be made.

（実施の形態２）
一連の強化学習と意思決定は２つ以上の行動に対しても、対応する電極を適宜増設することによって実施することが可能である。具体的には、上述の起電力の判定基準を、最も高いもしくは低い起電力を示す行動を選択する、と改めればよい。よって、原理上は取り扱うことが出来る行動の数には制限がない。 (Embodiment 2)
A series of reinforcement learning and decision making can be carried out for two or more actions by adding corresponding electrodes as appropriate. Specifically, the above-mentioned criteria for determining the electromotive force may be changed to select the action showing the highest or lowest electromotive force. Therefore, in principle, there is no limit to the number of actions that can be handled.

以下、図を用いて詳細に説明する。
電極の数を第１の電極４、第２の電極５、そして第３の電極６と３つに増やした電解質素子５１の例を図８に示す。ここで、第１の電極４、第２の電極５、第３の電極６をそれぞれ行動Ａ、Ｂ、Ｃに対応させた場合を考える。実施の形態１で述べた２端子電解質素子１を用いた場合と同様に、図９に示すように、報酬確率Ｐ_Ａに対応する電流を第１の電極４、第２の電極５、第３の電極６の間に流す。こうした試行を繰り返すことで、最も報酬確率の高い行動を選択することが可能になる。 Hereinafter, it will be described in detail with reference to the drawings.
FIG. 8 shows an example of the electrolyte element 51 in which the number of electrodes is increased to three, that is, the first electrode 4, the second electrode 5, and the third electrode 6. Here, consider a case where the first electrode 4, the second electrode 5, and the third electrode 6 correspond to actions A, B, and C, respectively. As with the embodiment 1 2 terminal electrolyte device 1 described in the embodiment, as shown in FIG. 9, compensation probability P _A current first electrode 4 corresponding to the second electrode 5, third Flow between the electrodes 6 of. By repeating these trials, it becomes possible to select the action with the highest reward probability.

電極の数が３つの３端子電解質素子３１を用いた意思決定装置１１０の例を図１０に示す。図１０の意思決定装置１１０では、第１の電極３３は、電解質材料層３２からなる層を挟んで第２の電極３４及び第３の電極３５と対向した場合であるが、電極が並列に並んでいる３端子電解質素子５１とその機能は変わらない。意思決定装置１１０では、電源スイッチ４１、４２、４４、４５、４７，４８、及び電圧計４３，４６，４９を使って、実施の形態１と同様の手法で行動Ａ、Ｂ、Ｃに対応して、報酬確率の高い行動を選択することが可能である。 FIG. 10 shows an example of the decision-making apparatus 110 using the 3-terminal electrolyte element 31 having three electrodes. In the decision-making device 110 of FIG. 10, the first electrode 33 faces the second electrode 34 and the third electrode 35 with a layer made of the electrolyte material layer 32 interposed therebetween, but the electrodes are arranged in parallel. The function is the same as that of the 3-terminal electrolyte element 51. In the decision-making device 110, the power switches 41, 42, 44, 45, 47, 48, and the voltmeters 43, 46, 49 are used to correspond to the actions A, B, and C in the same manner as in the first embodiment. Therefore, it is possible to select an action with a high reward probability.

（実施の形態３）
本技術を用いた場合、電解質素子１個による試行では最も報酬確率が高い行動のみしか決定出来ないのに対し、素子を増やすことによってより困難な問題を解くことが可能になる。電極を複数取り付けた電解質素子７及び８を、配線切替機９を介して電源（直流電源）６０に接続した学習記憶装置部１２０を図１１に示す。ここで、学習記憶装置部１２０は、意思決定装置の一部で、学習手段と電荷供給手段からなるモジュールである。 (Embodiment 3)
When this technology is used, only the action with the highest reward probability can be determined in the trial using one electrolyte element, but it becomes possible to solve a more difficult problem by increasing the number of elements. FIG. 11 shows a learning storage device unit 120 in which the electrolyte elements 7 and 8 to which a plurality of electrodes are attached are connected to the power supply (DC power supply) 60 via the wiring switching device 9. Here, the learning storage device unit 120 is a part of the decision-making device, and is a module including the learning means and the charge supply means.

電解質素子８の最も高い電位を示す電極が第１の電極６１の場合、第１の電極６１に報酬確率Ｐ_Ａに対応する電流を流す。このとき、図１２に示すように、電解質素子８の第１の電極６４と電解質素子７の第１の電極６１とを電気的に繋ぎ、電解質素子７の第１の電極６１以外の電極、例えば第３の電極６３と、それに対応する電解質素子８の第３の電極６６を電源（直流電源）６０に電気的に繋ぎ、電流を流す。この場合、電解質素子７の第３の電極６３と電解質素子８のそれに対応する第３の電極６６には、それぞれ逆の符合の電荷が蓄積される。 When the electrode having the highest potential of the electrolyte element 8 of the first electrode 61, electric current corresponding to the remuneration probability P _A to the first electrode 61. At this time, as shown in FIG. 12, the first electrode 64 of the electrolyte element 8 and the first electrode 61 of the electrolyte element 7 are electrically connected, and an electrode other than the first electrode 61 of the electrolyte element 7, for example. The third electrode 63 and the third electrode 66 of the corresponding electrolyte element 8 are electrically connected to the power supply (DC power supply) 60, and a current is passed through the power supply (DC power supply) 60. In this case, charges having opposite signatures are accumulated in the third electrode 63 of the electrolyte element 7 and the third electrode 66 corresponding to the third electrode 63 of the electrolyte element 8.

次に、電解質素子７の第１の電極６１以外の電極として第２の電極６２を選択した場合は、図１３に示すように、ここでも電解質素子７の第２の電極６２と電解質素子８の第２の電極６５にはそれぞれ逆の符合の電荷が蓄積される。 Next, when the second electrode 62 is selected as an electrode other than the first electrode 61 of the electrolyte element 7, as shown in FIG. 13, again, the second electrode 62 of the electrolyte element 7 and the electrolyte element 8 Charges having opposite signatures are accumulated in the second electrodes 65, respectively.

こうした試行を第１の電解質素子７と第２の電解質素子８で交互に繰り返していくことで、最終的に電解質素子７と電解質素子８は異なった行動を選択するが、これは報酬確率の最も高い上位２つの行動に対応する。
このように、電解質素子１個による試行では最も報酬確率が高い行動のみしか決定出来ないのに対し、電解質素子の数を増やし、配線切替機（配線切替手段）を用いて適宜各電解質素子間の電極の電気的接合と切り離し、電源への接合と切り離しを行うことで上位２つ以上を決定するというより困難な問題を解くことが可能になる。 By alternately repeating such trials in the first electrolyte element 7 and the second electrolyte element 8, the electrolyte element 7 and the electrolyte element 8 finally select different actions, which is the highest reward probability. Corresponds to the top two high behaviors.
In this way, while a trial using one electrolyte element can determine only the action with the highest reward probability, the number of electrolyte elements is increased and a wiring switching machine (wiring switching means) is used to appropriately connect between the electrolyte elements. By disconnecting the electrodes from the electrical junction and joining and disconnecting them from the power supply, it is possible to solve the more difficult problem of determining the top two or more.

以下、実施例により本発明をさらに詳細に説明するが、当然のこととして、本発明は以下の実施例に限定されるものではなく、特許請求の範囲のみにより規定されるものであることに注意されたい。 Hereinafter, the present invention will be described in more detail with reference to Examples, but it should be noted that the present invention is not limited to the following Examples, but is defined only by the claims. I want to be.

（実施例１）
実施例１では、図２に示す意思決定装置１００を用いて、意思決定の評価を行った。そこでは、電解質素子１１の電極数を２とし、報酬化率Ｐ_Ａ、Ｐ_Ｂに応じてその２つの電極間に電源スイッチ１５及び１６を通じて下記所定の電圧を印加して、起電力の変化を電圧計１４でモニターした。電解質素子１１の電極４，５にはグラファイトを用い、電解質材料層３の電解質としては液体電解質であるテトラメチルアンモニウム-テトラフルオロボラート（ＴＭＡ−ＢＦ_４）を用いた（図５参照）。 (Example 1)
In Example 1, the decision-making apparatus 100 shown in FIG. 2 was used to evaluate the decision-making. There, the number of electrodes of the electrolyte element 11 and 2, compensation rate P _A, by applying a following predetermined voltage through the power switch 15 and 16 between the two electrodes in accordance with the P _B, the change in electromotive force It was monitored by a voltmeter 14. Graphite was used for the electrodes 4 and 5 of the electrolyte element 11, and tetramethylammonium-tetrafluoroborate (TMA-BF ₄ ), which is a liquid electrolyte, was used as the electrolyte of the electrolyte material layer 3 (see FIG. 5).

行動Ａ及びＢの報酬確率をそれぞれＰ_Ａ＝８０％、Ｐ_Ｂ＝２０％とし、正の起電力を示した場合に行動Ａを選択、負の起電力を示した場合に行動Ｂを選択するとした。それぞれの行動Ａ、ＢにおいてＰ_Ａ、Ｐ_Ｂの確率で報酬を得た場合に印加する電流値を４ｍＡ、得なかった場合の電流値を３．９ｍＡとした。また、電流の印加時間と回路解放時間をそれぞれ１秒間とした。以上の条件で行った試行により両電極間に生じた起電力変化の例を図１４に示す。図７を用いて説明したのと同様の起電力変化が数１００ｍＶ程度の大きさで実際に観察されていることがわかる。これは電流印加により電極界面近傍の電気二重層が変調されることに起因する。 P _{A =} 80% action A and B compensation probabilities respectively, and P B = _20%, select an action A if it demonstrated a positive electromotive force, selecting an action B if it demonstrated a negative electromotive force did. Each action A, _P A in B, and the current value to be applied to the case where rewarded with a probability of _{P B} 4mA, and 3.9mA current value when not obtained. Further, the current application time and the circuit release time were set to 1 second each. FIG. 14 shows an example of the change in electromotive force generated between both electrodes by the trial performed under the above conditions. It can be seen that the same electromotive force change as described with reference to FIG. 7 is actually observed with a magnitude of about several hundred mV. This is because the electric double layer near the electrode interface is modulated by the application of current.

時間に対して報酬確率が変化する行動群の中から場面に応じた強化学習によって最適な行動を選択させるという観点から、Ｐ_Ａ、Ｐ_Ｂの変化に対する追従性が重要となる。そこで、この測定では試行回数１００回毎にＰ_ＡとＰ_Ｂの大きさを入れ替えている。その際に装置が報酬確率の高い行動を正しく選択した確率（正答確率）を試行回数に対してプロットすると図１５となる。試行回数０回から１０回では正答確率が４０％以下であるが、試行回数４０回でほぼ９０％以上に到達している。次に、試行回数１００回を超えた時点でＰ_ＡとＰ_Ｂの値を反転させた所、直後は正答確率が０％に落ち込んだ。しかし、報酬確率の変動に対応して再び正答確率を高め、試行回数１５０回で再び正答確率がほぼ９０％に達した。報酬確率の変動をさらに与えたが、同様に速やかに正答確率を回復させる挙動が観察された。 From the viewpoint of selecting an optimal action by the time reinforcement learning compensation probability corresponding to a scene from the action set which changes with respect to, followability to P _A, the change in P _B is important. Therefore, and replacing the magnitude of P _A and P _B in attempts every 100 times in this measurement. At that time, the probability that the device correctly selects an action with a high reward probability (correct answer probability) is plotted against the number of trials, and the result is FIG. The probability of correct answer is 40% or less when the number of trials is 0 to 10, but it reaches almost 90% or more when the number of trials is 40. Next, where by inverting the value of P _A and P _B at the time of exceeding the number of trials 100 times, immediately fell correct probability to 0%. However, the correct answer probability was increased again in response to the fluctuation of the reward probability, and the correct answer probability reached almost 90% again after 150 trials. Although the reward probability fluctuated further, the behavior of quickly recovering the correct answer probability was also observed.

図１６にＰ_Ａ、Ｐ_Ｂを７０％、３０％として図１５と同様に試行回数１００回毎に入れ替えた際の正答確率の変化を示す。正答確率が９０％以上に収束する試行回数が５０回から７０回と相対的に増加している。これは、図１５での試行と比較してＰ_ＡとＰ_Ｂの値が近く、意思決定までにより多くの試行回数を要する難しい問題であることと対応しており、合理的な結果と言える。 70% _P A, _{P B} in FIG. 16 shows a change in the correct probability when replaced in each trial 100 times in the same manner as FIG. 15 as 30%. The number of trials in which the probability of correct answer converges to 90% or more has increased relatively from 50 to 70. This, P _A and values near P _B compared to trials in FIG. 15, corresponds with that a difficult problem requiring more attempts before decisions can be said that a reasonable result.

（実施例２）
実施例２では、実施例１で用いた装置の電解質のみを液体電解質から固体電解質であるナフィオンに代えて実施例１と同様の測定を行った場合を示す。報酬確率Ｐ_Ａ、Ｐ_Ｂを６０％、４０％として試行回数２００回毎に入れ替えて測定を行った結果を図１７に示すが、その図から実施例１と同様の正答確率の変化が確認出来る。これは、液体、固体という電解質の状態に関わらずイオン伝導性によって強化学習、及びそれに伴う意思決定が可能となっていることを示している。この例ではナフィオン中を伝導するプロトンによって機能が得られている。 (Example 2)
In Example 2, only the electrolyte of the apparatus used in Example 1 is replaced with Nafion, which is a solid electrolyte, and the same measurement as in Example 1 is performed. Compensation probability P _A, the P _B 60%, shows the results of measurement by replacing each number of trials 200 times as a 40% to 17, changes in the same correct answer probabilities of Example 1 from the figure can be confirmed .. This indicates that reinforcement learning and associated decision-making are possible due to ionic conductivity regardless of the state of the electrolyte, liquid or solid. In this example, the function is obtained by the protons conducted in the naphthion.

綱引き原理は、学習結果を強く反映した強化学習に位置づけられている。本発明の意思決定装置は、小型で簡易なデバイスでかつ複雑な計算を必要とせずに、その強化学習に基づいて効率的に意思決定を行うことが可能である。このため、本発明の意思決定装置は産業分野で大いに利用される可能性がある。 The tug-of-war principle is positioned as reinforcement learning that strongly reflects the learning results. The decision-making device of the present invention is a small and simple device, and can efficiently make a decision based on its reinforcement learning without requiring complicated calculations. Therefore, the decision-making device of the present invention has the potential to be widely used in the industrial field.

１：陰イオン
２：陽イオン
３：電解質材料層
４：第１の電極
５：第２の電極
６：第３の電極
７：電解質素子
８：電解質素子
９：配線切替機
１１：電解質素子
１２：第１の電源スイッチ
１３：第２の電源スイッチ
１４：電圧計
１５，１６：入力信号
２１，２５：ＭＯＳトランジスタ
２２，２６：直流電源
２３，２７：可変抵抗
２４，２８：ゲート
３１：電解質素子
３２：電解質材料層
３３：第１の電極
３４：第２の電極
３５：第３の電極
４１，４２，４４，４５，４７，４８：電源スイッチ
４３，４６，４９：電圧計
５１：電解質素子
６０：電源（直流電源）
１００，１１０：意思決定装置
１２０：学習記憶装置部
1: Anion 2: Cation 3: Electrolyte material layer 4: First electrode 5: Second electrode 6: Third electrode 7: Electrolyte element 8: Electrolyte element 9: Wiring switch 11: Electrolyte element 12: First power switch 13: Second power switch 14: Voltage meter 15, 16: Input signal 21, 25: MOS transistor 22, 26: DC power supply 23, 27: Variable resistor 24, 28: Gate 31: Electrolyte element 32 : Electrolyte material layer 33: First electrode 34: Second electrode 35: Third electrode 41, 42, 44, 45, 47, 48: Power switch 43, 46, 49: Voltage meter 51: Electrolyte element 60: Power supply (DC power supply)
100, 110: Decision-making device 120: Learning storage device

Claims

電荷の蓄積により学習を行う学習手段、報酬確率の異なる行動の中から選択した行動に応じた電荷を前記学習手段に与える電荷供給手段、及び前記学習手段の電圧を読み取る電圧読み取り手段を有し、前記電圧読み取り手段で読み取った電圧により報酬確率の異なる行動の中から選択する行動を決定する意思決定装置であって、
前記学習手段は、電場によるイオンの輸送が可能な電解質材料層を同一の材料からなる２以上の電極で挟んだ電解質素子からなる、意思決定装置。 A learning means for learning by the accumulation of charge, charge supply means for providing a charge corresponding to actions selected from among different actions of reward probability to said learning means, and the voltage reading means for reading the voltage of said learning means , a decision device for determining an action to choose from different behavior of the voltage by Ri reward probability voltage read by the reading means,
The learning means is a decision-making device including an electrolyte element in which an electrolyte material layer capable of transporting ions by an electric field is sandwiched between two or more electrodes made of the same material.

前記２以上の電極間に前記電荷の流入による電流を流して前記イオンを輸送し、前記電極間に電圧を生じさせる、請求項１記載の意思決定装置。 The decision-making apparatus according to claim 1, wherein a current due to an inflow of electric charge is passed between the two or more electrodes to transport the ions, and a voltage is generated between the electrodes.

前記イオンの前記２以上の電極のうちの少なくとも１の電極側への移動または電極内への侵入により、前記２以上の電極に電子及び正孔が生成されて電圧が発生する、請求項１または２記載の意思決定装置。 Claim 1 or claim 1, wherein electrons and holes are generated in the two or more electrodes by moving the ions toward the electrode side or invading the electrodes of at least one of the two or more electrodes. 2. The decision-making device described.

前記電解質材料層は液体電解質または固体電解質を含む、請求項１から３の何れか１に記載の意思決定装置。 The decision-making apparatus according to any one of claims 1 to 3, wherein the electrolyte material layer contains a liquid electrolyte or a solid electrolyte.

前記液体電解質は、テトラメチルアンモニウムイオン（ＴＭＡ^＋）、テトラエチルアンモニウムイオン（ＴＥＡ^＋）、テトラブチルアンモニウムイオン（ＴＢＡ^＋）、テトラフルオロホウ酸イオン（ＢＦ_４ ⁻）、Ｎ，Ｎ−ジエチル−Ｎ−メチル−Ｎ−（２−メトキシエチル）アンモニウム−ビス（トリフルオロメタンスルホニル）イミド（ＤＥＭＥ−ＴＦＳＩ）、Ｎ，Ｎ−ジエチル−Ｎ−メチル−Ｎ−（２−メトキシエチル）アンモニウム−テトラフルオロボラート（ＤＥＭＥ−ＢＦ_４）からなる群の少なくとも１を含む、請求項４記載の意思決定装置。 The liquid electrolyte, tetramethylammonium ion ^(TMA +), tetraethylammonium ion ^(TEA +), tetrabutyl ammonium ion ^(TBA +), tetrafluoroborate ion _{^{(BF 4 -), N,}} N- diethyl--N- Methyl-N- (2-methoxyethyl) ammonium-bis (trifluoromethanesulfonyl) imide (DEME-TFSI), N, N-diethyl-N-methyl-N- (2-methoxyethyl) ammonium-tetrafluoroborate ( The decision-making apparatus according to claim 4, which comprises at least one of the group consisting of DEME-BF _4).

前記電解質材料は可動イオンを有する高分子化合物を含む請求項４に記載の意思決定装置。 The decision-making apparatus according to claim 4, wherein the electrolyte material contains a polymer compound having mobile ions.

前記高分子化合物はポリエチレンオキシドまたはナフィオンの少なくとも何れかを含む、請求項６に記載の意思決定装置。 The decision-making apparatus according to claim 6, wherein the polymer compound contains at least one of polyethylene oxide and naphthion.

前記電解質材料層は可動イオンを有する金属酸化物またはケイ酸（ＳｉＯ_２）の少なくとも何れかを含む、請求項４に記載の意思決定装置。 The decision-making apparatus according to claim 4, wherein the electrolyte material layer contains at least one of a metal oxide having mobile ions or silicic acid (SiO _2).

前記金属酸化物は、酸化セリウム（ＣｅＯ_２）、酸化タンタル（Ｔａ_２Ｏ_５）、酸化ジルコニウム（ＺｒＯ_２）、酸化ニオブ（Ｎｂ_２Ｏ_５）、酸化タングステン（ＷＯ_３）、酸化リチウム（Ｌｉ_２Ｏ）からなる群の少なくとも１を含む、請求項８に記載の意思決定装置。 The metal oxides include cerium oxide (CeO ₂ ), tantalum pentoxide (Ta ₂ O ₅ ), zirconium oxide (ZrO ₂ ), niobium oxide (Nb ₂ O ₅ ), tungsten oxide (WO ₃ ), and lithium oxide (Li _2). The decision-making apparatus according to claim 8, which comprises at least one of the group consisting of O).

前記２以上の電極は電子伝導性を有する金属または半導体の少なくとも何れかを含む、請求項４に記載の意思決定装置。 The decision-making apparatus according to claim 4, wherein the two or more electrodes include at least one of a metal or a semiconductor having electron conductivity.

前記金属は、金、白金、銀、パラジウム、アルミニウム、鉄、銅、タングステン、チタン、タンタルからなる群の少なくとも１を含む、請求項１０に記載の意思決定装置。 The decision-making apparatus according to claim 10, wherein the metal comprises at least one of the group consisting of gold, platinum, silver, palladium, aluminum, iron, copper, tungsten, titanium and tantalum.

前記半導体は、炭素、シリコン、コバルト酸リチウムからなる群の少なくとも１を含む、請求項１０に記載の意思決定装置。 The decision-making apparatus according to claim 10, wherein the semiconductor includes at least one of the group consisting of carbon, silicon, and lithium cobalt oxide.

前記金属及び半導体は、電場下でイオンとの化学反応が可能な活性物質を含む、請求項１０に記載の意思決定装置。 The decision-making apparatus according to claim 10, wherein the metal and the semiconductor include an active substance capable of chemically reacting with ions under an electric field.

前記金属及び半導体は、電場下でイオン輸送が可能な電解質を含み、前記電解質材料層内及び前記２以上の電極のうちの一方の電極内のイオンが移動して他方の電極内に前記イオンが侵入する、請求項１０に記載の意思決定装置。 The metal and the semiconductor include an electrolyte capable of transporting ions under an electric field, and ions in the electrolyte material layer and in one of the two or more electrodes move to move the ions into the other electrode. The decision-making device according to claim 10, which invades.

前記意思決定装置は配線切替手段を有する、請求項１から１４の何れかに記載の意思決定装置。 The decision-making device according to any one of claims 1 to 14, wherein the decision-making device has wiring switching means.