JP2020082314A

JP2020082314A - Learning device, robot control method, and robot control system

Info

Publication number: JP2020082314A
Application number: JP2018224020A
Authority: JP
Inventors: 友樹山岸; Tomoki Yamagishi
Original assignee: Kyocera Document Solutions Inc
Current assignee: Kyocera Document Solutions Inc
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2020-06-04
Anticipated expiration: 2038-11-29
Also published as: JP7247552B2

Abstract

To enable a user to plan highly efficient actions for a robot arm, without making the user's workload heavy.SOLUTION: A learning device includes a learning unit 213 for learning actions of a robot arm 11, by associating a state of the robot arm 11 at a certain time, observed by a state observation unit 212, an action of the robot arm 11 from the observed state, and a state of the robot arm 11 after the action with each other, in an action process of the robot arm 11 until arriving at a target arrival position. In addition, the learning unit 213 is provided with: a reward calculation unit 213A for calculating a reward in accordance with a trajectory along which the robot arm 11 has moved, and drive of an arm driving unit 16; and a function updating part 213B for updating an action value function which indicates a value for selecting a certain action from a certain state of the robot arm 11, on the basis of the reward calculated by the reward calculation unit 213A.SELECTED DRAWING: Figure 2

Description

本発明は、ロボットの行動を学習する学習装置、ロボット制御装置、及びロボット制御システムに関する。 The present invention relates to a learning device, a robot control device, and a robot control system for learning the behavior of a robot.

ロボットアームを構成する各関節の駆動を制御するロボット制御装置は、ロボットアームの先端部の現在位置と、目標到達位置（例えば、ロボットアームの作業対象となるワークの存在位置）とからロボットアームの目標軌跡を生成し、当該目標軌跡に基づいて、各時刻における各関節の回転角を計算し、計算した結果に従って、各関節を駆動する駆動モーターを制御するのが一般的である。 A robot control device that controls the drive of each joint that constitutes the robot arm uses the robot arm based on the current position of the tip of the robot arm and the target arrival position (for example, the position of the work to be worked by the robot arm). Generally, a target locus is generated, a rotation angle of each joint is calculated based on the target locus, and a drive motor that drives each joint is controlled according to the calculated result.

特許第６２４０６８９号公報Japanese Patent No. 6240689 特許第５５２８２１４号公報Patent No. 5528214

ロボットアームの目標軌跡は、ロボットアームの先端部の現在位置と、目標到達位置とから、コンピューターが計算することによって生成することができるが、効率のよい目標軌跡が生成されるとは限らない。 The target trajectory of the robot arm can be calculated by a computer from the current position of the tip of the robot arm and the target reaching position, but an efficient target trajectory is not always generated.

また、ロボットは人間がプログラミングしたプログラムに従って、その通りに行動するが、人間が行うのと同じ作業手順となるように、ロボットの行動をプログラミングしてしまうと、効率が悪い場合がある。それは、ロボットにはロボットアームの可動域などの制約事項があるからである。 Further, the robot behaves according to a program programmed by a human, but if the behavior of the robot is programmed so that the work procedure is the same as that performed by a human, it may be inefficient. This is because the robot has restrictions such as the range of motion of the robot arm.

また、効率のよいプログラミングができたとしても、目標到達位置が変更されるなど、何かしらの変更があれば、プログラミングし直さなければならず、ユーザーの作業負担が大きくなる。また、ロボットの機種が変わり、機械の仕様が変わった場合にも、やはり再プログラミングが必要になる。 Further, even if efficient programming can be performed, if there is any change such as a change in the target reaching position, the programming must be performed again, which increases the work load on the user. Also, if the robot model changes and the machine specifications change, reprogramming is still necessary.

ところで、目標到達位置へ到達するまでのロボットアームにおける効率のよい行動（例えば、移動距離が短い、電力消費量が少ない）を計画するには、ロボットの動作を学習させるという方法が考えられる。 By the way, in order to plan an efficient action (for example, a short moving distance and a small power consumption amount) in the robot arm until reaching the target reaching position, a method of learning the action of the robot can be considered.

上記特許文献１に、ロボットと人が協働するが、人の行動パターンが多いのでそれに最適なロボットの制御方法の設定が難しいという課題から、ロボットの行動時間と人の負担（ロボットの加速度）を基に、予め定められた移動点におけるロボットの行動価値を更新する強化学習方法により最適な制御方法を設定するという内容が記載されている。 In Patent Document 1 described above, a robot and a person collaborate with each other, but since there are many human behavior patterns and it is difficult to set an optimal robot control method, the robot's action time and human burden (robot acceleration) Based on the above, the content of setting an optimal control method by a reinforcement learning method for updating the action value of the robot at a predetermined moving point is described.

上記特許文献２に、ロボットなどの制御に採用されている、現状の強化学習では、教示された内容を試行錯誤しながら、自己の状況に合わせて効率的に学習することができないという課題から、一連の状態・行動の集合であるイベントリストをデータベース管理し、効率的に行動価値を探索するという内容が記載されている。 In the above-mentioned Patent Document 2, in the current reinforcement learning adopted for the control of a robot or the like, it is impossible to efficiently learn according to one's own situation while trial and error of the taught contents. It describes that an event list, which is a set of a series of states/actions, is managed in a database and the action value is efficiently searched.

上記特許文献１，２のいずれにも、ロボットの動作を学習することについての記載はあるが、目標到達位置へ到達するまでのロボットアームにおける効率のよい行動を計画することについての記載はない。 Although neither of the above-mentioned Patent Documents 1 and 2 describes learning the operation of the robot, there is no description about planning an efficient action in the robot arm until reaching the target reaching position.

本発明は、上記の事情に鑑みなされたものであり、ユーザーの作業負担を大きくすることなく、ロボットアームにおける効率のよい行動を計画できるようにすることを目的とする。 The present invention has been made in view of the above circumstances, and an object of the present invention is to enable efficient planning of actions in a robot arm without increasing the work burden on the user.

本発明の一局面に係る学習装置は、複数の関節を有し、三次元空間を自在に移動可能なロボットアームと、前記複数の関節それぞれに設けられた、前記関節を駆動する関節駆動部と、前記ロボットアームの位置を含む、当該ロボットアームの状態を検出する状態検出部と、を備えるロボットの行動を学習する学習装置であって、前記状態検出部による検出結果に基づいて、前記ロボットアームの状態を観測する状態観測部と、予め設定された目標到達位置へ到達するまでの前記ロボットアームの行動過程において、前記状態観測部により観測される、ある時刻における前記ロボットアームの状態と、当該状態からの前記ロボットアームの行動と、当該行動後の前記ロボットアームの状態とを関連付けて、前記ロボットアームの行動を学習する学習部と、を備え、更に、前記学習部は、前記ロボットアームの動いた軌跡、及び前記関節駆動部の駆動に応じて報酬を計算する報酬計算部と、前記報酬計算部により計算された前記報酬に基づいて、前記ロボットアームのある状態からある行動を選択する価値を示す行動価値関数を更新する関数更新部と、を備える。 A learning device according to one aspect of the present invention includes a robot arm having a plurality of joints and capable of freely moving in a three-dimensional space, and a joint drive unit that is provided in each of the plurality of joints and that drives the joints. A learning device for learning the behavior of a robot, comprising: a state detection unit that detects the state of the robot arm, including the position of the robot arm, wherein the robot arm is based on a detection result of the state detection unit. A state observing section for observing the state of the robot arm, and a state of the robot arm at a certain time, which is observed by the state observing section in the action process of the robot arm until reaching a preset target reaching position, A learning unit that learns the behavior of the robot arm by associating the behavior of the robot arm from the state with the state of the robot arm after the behavior, and the learning unit further includes a learning unit of the robot arm. The value of selecting a certain action from a certain state of the robot arm based on a reward calculation unit that calculates a reward according to the trajectory that has moved and the driving of the joint drive unit, and the reward calculated by the reward calculation unit. And a function updating unit for updating the action value function indicating.

また、本発明の一局面に係るロボット制御装置は、上記学習装置と、前記学習装置による学習結果に基づいて、前記ロボットアームに行わせる行動を選択する意思決定部と、を備え、前記意思決定部による意思決定に基づいて、前記ロボットアームの行動を制御する。 Further, a robot control device according to one aspect of the present invention includes the learning device, and a decision-making unit that selects an action to be performed by the robot arm based on a learning result by the learning device. The behavior of the robot arm is controlled based on the decision made by the department.

また、本発明の一局面に係るロボット制御システムは、上記ロボット制御装置と、前記ロボットと、を備える。 A robot control system according to one aspect of the present invention includes the robot control device and the robot.

本発明によれば、実際の学習結果から、ロボットアームのある状態から価値の高い行動を自動的に選択することができるので、ユーザーの作業負担を大きくすることなく、ロボットアームの先端部が目標到達位置へ到達するまでの効率のよい行動を計画することが可能となる。 According to the present invention, it is possible to automatically select a valuable action from a certain state of the robot arm based on the actual learning result, so that the tip of the robot arm can be targeted without increasing the work burden on the user. It is possible to plan an efficient action until reaching the arrival position.

本発明の第１実施形態に係る学習装置を有するロボット制御装置を含んで構成されるロボット制御システムの主要内部構成を概略的に示した機能ブロック図である。1 is a functional block diagram schematically showing a main internal configuration of a robot control system including a robot control device having a learning device according to a first embodiment of the present invention. ロボット制御システムを構成する各構成間でのデータ等の流れを説明するための説明図である。It is an explanatory view for explaining a flow of data etc. between each composition which constitutes a robot control system. 制御の対象となるロボットを模式的に示した外観図である。FIG. 3 is an external view schematically showing a robot to be controlled. ロボット制御装置の制御ユニットで行われる処理動作の一例を示したフローチャートである。It is a flow chart showing an example of processing operation performed by a control unit of a robot controller. 第２実施形態に係る学習装置を有するロボット制御装置を含んで構成されるロボット制御システムの主要内部構成を概略的に示した機能ブロック図である。It is the functional block diagram which showed roughly the main internal structure of the robot control system comprised including the robot controller which has the learning apparatus which concerns on 2nd Embodiment.

以下、本発明の一実施形態に係る学習装置、ロボット制御装置、及びロボット制御システムについて図面を参照して説明する。図１は、第１実施形態に係る学習装置を有するロボット制御装置を含んで構成されるロボット制御システムの主要内部構成を概略的に示した機能ブロック図である。図２は、ロボット制御システムを構成する各構成間でのデータ等の流れを説明するための説明図である。図３は、制御の対象となるロボットを模式的に示した外観図である。 Hereinafter, a learning device, a robot control device, and a robot control system according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a functional block diagram schematically showing a main internal configuration of a robot control system including a robot control device having a learning device according to the first embodiment. FIG. 2 is an explanatory diagram for explaining the flow of data and the like between the constituent elements of the robot control system. FIG. 3 is an external view schematically showing a robot to be controlled.

ロボット制御システム１は、ロボット１０と、ロボット１０の動作を制御するロボット制御装置２０と、を含んで構成されている。 The robot control system 1 includes a robot 10 and a robot controller 20 that controls the operation of the robot 10.

ロボット１０は、図３に示すように、人間の腕と同様の運動機能を持つマニピュレーターで、三次元空間を自在に移動可能なロボットアーム１１を備え、ロボットアーム１１の根元は台座１４に固定されている。ロボットアーム１１は、複数の関節１２Ａ乃至１２Ｃ（以降、まとめて「関節１２」とも称す）と、関節１２どうしをつなぐリンク１３Ａ，１３Ｂとを有する。 As shown in FIG. 3, the robot 10 is a manipulator having a movement function similar to that of a human arm, and is provided with a robot arm 11 that can freely move in a three-dimensional space. The base of the robot arm 11 is fixed to a pedestal 14. ing. The robot arm 11 has a plurality of joints 12A to 12C (hereinafter, also collectively referred to as “joint 12”) and links 13A and 13B connecting the joints 12 to each other.

また、ロボットアーム１１は、その先端部１５にエンドエフェクターが着脱交換可能に構成されている。図３中では、エンドエフェクターとして、平行に配置された２つの爪部４１Ａ，４１Ｂを有するグリッパー４１が取り付けられている。グリッパー４１は、例えば、箱４３に入れられたワーク４２を把持して、ワーク４２を別の場所へ運ぶことに使用される。また、グリッパー４１には、空気圧により爪部４１Ａ，４１Ｂを駆動するグリッパー駆動部４１Ｃ（例えば、シリンダー）が内蔵されている。 Further, the robot arm 11 is configured such that an end effector can be attached to and detached from the tip portion 15 of the robot arm 11. In FIG. 3, a gripper 41 having two claw portions 41A and 41B arranged in parallel is attached as an end effector. The gripper 41 is used, for example, to grip the work 42 contained in the box 43 and carry the work 42 to another place. Further, the gripper 41 has a built-in gripper drive unit 41C (for example, a cylinder) that drives the claws 41A and 41B by air pressure.

ロボット１０は、関節１２それぞれに設けられた、関節１２（すなわち、ロボットアーム１１）を駆動するアーム駆動部１６Ａ乃至１６Ｃ（以降、まとめて「アーム駆動部１６」とも称す）と、関節１２それぞれに設けられた、関節１２の回転角を検出する回転角検出部１７Ａ乃至１７Ｃ（以降、まとめて「回転角検出部１７」とも称す）と、アーム駆動部１６それぞれのトルクを検出するトルクセンサー１８Ａ乃至１８Ｃ（以降、まとめて「トルクセンサー１８」とも称す）と、ロボット１０の上方に設けられ、当該ロボット１０の作業環境を検出する作業環境検出部１９と、を備える。なお、アーム駆動部１６、回転角検出部１７、及び作業環境検出部１９としてはそれぞれ、例えば、モーター、エンコーダー、カメラが挙げられる。 The robot 10 includes arm drive units 16A to 16C (hereinafter collectively referred to as “arm drive unit 16”) that are provided in each joint 12 and that drives the joint 12 (that is, the robot arm 11) and each joint 12. Rotation angle detection units 17A to 17C (hereinafter, also collectively referred to as "rotation angle detection unit 17") provided to detect the rotation angle of the joint 12, and torque sensors 18A to 18A to detect the torque of the arm drive unit 16 respectively. 18C (hereinafter, also collectively referred to as “torque sensor 18”), and a work environment detection unit 19 that is provided above the robot 10 and detects the work environment of the robot 10. Each of the arm drive unit 16, the rotation angle detection unit 17, and the work environment detection unit 19 may be, for example, a motor, an encoder, or a camera.

また、アーム駆動部１６は、特許請求の範囲における関節駆動部の一例で、回転角検出部１７及びトルクセンサー１８は、特許請求の範囲における状態検出部の一例である。ロボットアーム１１の先端部１５の位置は、関節１２Ａ乃至１２Ｃそれぞれの角度から割り出すことができる。なお、上記状態検出部としては、作業環境検出部１９を利用することも可能である。 The arm drive unit 16 is an example of a joint drive unit in the claims, and the rotation angle detector 17 and the torque sensor 18 are examples of a state detector in the claims. The position of the tip portion 15 of the robot arm 11 can be determined from the angles of the joints 12A to 12C. The work environment detection unit 19 can also be used as the state detection unit.

ロボット制御装置２０は、制御ユニット２１と、操作部２２と、表示部２３と、記憶部２４と、外部インターフェイス部（外部Ｉ／Ｆ）２５と、通信インターフェイス部（通信Ｉ／Ｆ）２６と、を備える。 The robot controller 20 includes a control unit 21, an operation unit 22, a display unit 23, a storage unit 24, an external interface unit (external I/F) 25, a communication interface unit (communication I/F) 26, Equipped with.

操作部２２は、キーボードやマウス等から構成され、制御ユニット２１にコマンドや文字を入力したり、表示部２３における画面上のポインターを操作したりする。表示部２３は、制御ユニット２１からの応答やデータ結果を表示する。操作部２２は、例えば、ロボットアーム１１の先端部１５の目標到達位置の指示入力に用いられる。なお、目標到達位置については、ユーザーからの指示ではなく、作業環境検出部１９が撮影して得られた画像から読み取ったワーク４２の位置から設定することも可能である。 The operation unit 22 includes a keyboard, a mouse, and the like, and inputs commands and characters to the control unit 21 and operates a pointer on the screen of the display unit 23. The display unit 23 displays the response and the data result from the control unit 21. The operation unit 22 is used, for example, for inputting an instruction of the target arrival position of the tip portion 15 of the robot arm 11. It should be noted that the target reaching position can be set from the position of the work 42 read from the image obtained by the working environment detecting unit 19 instead of an instruction from the user.

記憶部２４は、ＨＤＤ（Hard Disk Drive）などの記憶装置であり、ロボット制御装置２０の動作に必要なプログラムやデータを記憶し、後述する報酬テーブルを記憶する報酬テーブル記憶部２４１を含む。 The storage unit 24 is a storage device such as an HDD (Hard Disk Drive), and stores a program and data necessary for the operation of the robot control device 20, and includes a reward table storage unit 241 that stores a reward table described later.

外部インターフェイス部２５は、外部装置と接続するためのもので、ロボット制御装置２０は、外部インターフェイス部２５を介して、ロボット１０を構成するアーム駆動部１６、回転角検出部１７、トルクセンサー１８、作業環境検出部１９、及びグリッパー駆動部４１Ｃと接続されている。 The external interface unit 25 is for connecting to an external device, and the robot control device 20 allows the arm drive unit 16, the rotation angle detection unit 17, the torque sensor 18, which configure the robot 10, via the external interface unit 25. The work environment detection unit 19 and the gripper drive unit 41C are connected.

通信インターフェイス部２６は、不図示のＬＡＮ（Local Area Network）チップなどの通信モジュールを備えるインターフェイスで、外部装置３０との間で通信を行う。ロボット制御装置２０は、通信インターフェイス部２６を介して、例えば、他のロボット制御装置との間でデータの送受信を行う。 The communication interface unit 26 is an interface including a communication module such as a LAN (Local Area Network) chip (not shown), and communicates with the external device 30. The robot controller 20 transmits/receives data to/from another robot controller via the communication interface unit 26, for example.

制御ユニット２１は、プロセッサー、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、及び専用のハードウェア回路を含んで構成される。プロセッサーは、例えばＣＰＵ（Central Processing Unit）、ＡＳＩＣ（Application Specific Integrated Circuit）、又はＭＰＵ（Micro Processing Unit）等である。制御ユニット２１は、制御部２１１と、状態観測部２１２と、学習部２１３と、意思決定部２１４と、行動計画部２１５と、アーム指示部２１６と、グリッパー指示部２１７と、を備えている。なお、本発明に係る学習装置は、状態観測部２１２及び学習部２１３を含んで構成される。 The control unit 21 includes a processor, a RAM (Random Access Memory), a ROM (Read Only Memory), and a dedicated hardware circuit. The processor is, for example, a CPU (Central Processing Unit), an ASIC (Application Specific Integrated Circuit), an MPU (Micro Processing Unit), or the like. The control unit 21 includes a control unit 211, a state observation unit 212, a learning unit 213, a decision making unit 214, an action planning unit 215, an arm instruction unit 216, and a gripper instruction unit 217. The learning device according to the present invention includes a state observation unit 212 and a learning unit 213.

制御ユニット２１は、記憶部２４に記憶されている制御プログラムに従った上記プロセッサーによる動作により、制御部２１１、状態観測部２１２、学習部２１３、意思決定部２１４、行動計画部２１５、アーム指示部２１６、及びグリッパー指示部２１７として機能する。但し、制御ユニット２１等の上記の各構成は、制御ユニット２１による制御プログラムに基づく動作によらず、それぞれハードウェア回路により構成することも可能である。以下、特に触れない限り、各実施形態について同様である。 The control unit 21 is operated by the processor according to the control program stored in the storage unit 24, so that the control unit 211, the state observation unit 212, the learning unit 213, the decision making unit 214, the action planning unit 215, and the arm instruction unit. 216 and the gripper instruction unit 217. However, each of the above-described components such as the control unit 21 may be configured by a hardware circuit instead of the operation based on the control program by the control unit 21. Hereinafter, the same applies to each embodiment unless otherwise specified.

制御部２１１は、ロボット制御装置２０の全体的な動作制御を司る。制御部２１１は、操作部２２、表示部２３、記憶部２４、外部インターフェイス部２５、及び通信インターフェイス部２６と接続されており、接続されている上記各構成の動作制御や、各構成との間での信号またはデータの送受信を行う。 The control unit 211 controls the overall operation of the robot control device 20. The control unit 211 is connected to the operation unit 22, the display unit 23, the storage unit 24, the external interface unit 25, and the communication interface unit 26. Send and receive signals or data in.

状態観測部２１２は、状態検出部としての回転角検出部１７、トルクセンサー１８、及び作業環境検出部１９による検出結果に基づいて、ロボットアーム１１の状態を観測するもので、図２に示すように、物理量処理部２１２Ａと、画像処理部２１２Ｂと、を備える。 The state observing unit 212 observes the state of the robot arm 11 based on the detection results of the rotation angle detecting unit 17, the torque sensor 18, and the work environment detecting unit 19 as state detecting units, and as shown in FIG. And a physical quantity processing unit 212A and an image processing unit 212B.

物理量処理部２１２Ａは、回転角検出部１７により検出された関節１２それぞれの回転角を示す物理量を処理して、関節１２の回転角の大きさを算出すると共に、トルクセンサー１８により検出されたアーム駆動部１６（駆動モーター）それぞれのトルクを示す物理量を処理して、アーム駆動部１６のトルクを算出し、算出した結果を学習部２１３へ出力する。画像処理部２１２Ｂは、作業環境検出部１９が撮影した画像を処理し、ロボットアーム１１の行動に必要な情報を抽出し、抽出した結果を学習部２１３へ出力する。なお、アーム駆動部１６（駆動モーター）のトルクについては、トルクセンサー１８ではなく、モーター電流から換算して検出するようにしてもよい。 The physical quantity processing unit 212A processes the physical quantity indicating the rotation angle of each joint 12 detected by the rotation angle detection unit 17 to calculate the size of the rotation angle of the joint 12, and the arm detected by the torque sensor 18. The physical quantity indicating the torque of each drive unit 16 (drive motor) is processed to calculate the torque of the arm drive unit 16, and the calculated result is output to the learning unit 213. The image processing unit 212B processes the image captured by the work environment detection unit 19, extracts information necessary for the action of the robot arm 11, and outputs the extracted result to the learning unit 213. The torque of the arm driving unit 16 (driving motor) may be detected by converting from the motor current instead of the torque sensor 18.

学習部２１３は、目標到達位置へ到達するまでのロボットアーム１１の行動過程において、状態観測部２１２により観測される、ある時刻におけるロボットアーム１１の状態と、当該状態からのロボットアーム１１の行動と、当該行動後のロボットアーム１１の状態とを関連付けて、ロボットアーム１１の行動を学習する。 The learning unit 213 detects the state of the robot arm 11 at a certain time observed by the state observation unit 212 in the action process of the robot arm 11 until reaching the target reaching position, and the action of the robot arm 11 from the state. , And learn the action of the robot arm 11 by associating it with the state of the robot arm 11 after the action.

学習部２１３が実行する学習アルゴリズムとしては、例えば、強化学習などの機械学習としての公知のアルゴリズムを採用することができ、学習部２１３は、図２に示すように、報酬計算部２１３Ａと、関数更新部２１３Ｂと、を備える。 As the learning algorithm executed by the learning unit 213, for example, a known algorithm as machine learning such as reinforcement learning can be adopted. The learning unit 213, as shown in FIG. And an updating unit 213B.

強化学習のアルゴリズムとしては、例えば、Ｑ学習が挙げられる。Ｑ学習は、ある状態ｓの下で、行動ａを選択した場合の行動の価値を示す関数Ｑ（ｓ，ａ）を学習する方法である。ある状態ｓのときに、関数Ｑが最も高くなる行動ａが最適な行動となるが、学習を開始した時点では、状態ｓと行動ａとの相関性が分からないので、試行錯誤で、ある状態ｓの下で種々の行動ａを選択し、その時に与えられる報酬ｒを使って、関数Ｑを反復して更新し、関数Ｑを最適なものに近付ける。 An example of the reinforcement learning algorithm is Q learning. The Q learning is a method of learning a function Q(s, a) indicating the value of the action when the action a is selected under a certain state s. In a certain state s, the action a with the highest function Q is the optimal action. However, at the time when learning is started, the correlation between the state s and the action a is unknown. Various actions a are selected under s, and the reward r given at that time is used to iteratively update the function Q to bring the function Q close to the optimum one.

報酬計算部２１３Ａは、状態ｓで行動ａを選択した結果として環境（つまり、状態ｓ）が変化したときに、その環境の変化によって与えられる報酬ｒを計算するもので、ロボットアーム１１の動いた軌跡、及びアーム駆動部１６の駆動に応じて報酬を計算する。 The reward calculation unit 213A calculates the reward r given by the change of the environment when the environment (that is, the state s) changes as a result of selecting the action a in the state s, and the robot arm 11 moves. The reward is calculated according to the trajectory and the driving of the arm driving unit 16.

以下の数式１は報酬を与えるための計算式の一例で、２以上の項の和で表現される。 Formula 1 below is an example of a calculation formula for giving a reward, and is expressed by the sum of two or more terms.

報酬ｒ＝ｗ１×ｒ１＋ｗ２×ｒ２ … 数式１ Reward r=w1×r1+w2×r2 Equation 1

ｒ１はロボットアーム１１が動いた軌跡による報酬を示し、ｒ２はアーム駆動部１６（駆動モーター）それぞれの駆動による報酬を示している。ｗ１，ｗ２は各項それぞれの重みを示している。報酬ｒ１，ｒ２は、以下の数式３乃至５が示すように−１〜＋１の間で正規化した値で、その計算結果に重みとしてｗ１、ｗ２を掛け算した値の和が報酬ｒである。 r1 shows the reward by the locus|trajectory which the robot arm 11 moved, r2 has shown the reward by each drive of the arm drive part 16 (driving motor). w1 and w2 indicate the weights of the respective terms. The rewards r1 and r2 are values normalized between −1 and +1 as shown in the following mathematical formulas 3 to 5, and the sum of the values obtained by multiplying the calculation result by w1 and w2 is the reward r.

ｒ１＝（−２×距離ｄ）／ｋ１＋１（０＜ｄ≦ｋ１） … 数式２ r1=(−2×distance d)/k1+1 (0<d≦k1) Equation 2

ｒ１＝−１（ｋ１＜ｄ） … 数式３ r1=-1 (k1<d) Equation 3

ｒ２＝（−２×トルクｔ）／ｋ２＋１（０＜ｔ≦ｋ２） … 数式４ r2=(−2×torque t)/k2+1 (0<t≦k2) Equation 4

ｒ２＝−１（ｋ２＜ｔ） … 数式５ r2=-1 (k2<t) Equation 5

ｋ１，ｋ２は定数で、定数ｋ１としては、例えば、ロボットアーム１１の行動開始位置から目標到達位置までを直線で結んだ最短経路の距離が挙げられる。 k1 and k2 are constants, and examples of the constant k1 include the distance of the shortest route that connects the action start position of the robot arm 11 to the target arrival position with a straight line.

距離ｄは、ロボットアーム１１の先端部１５の位置と目標到達位置との距離を示し、距離ｄが短くなるにつれて、報酬ｒ１は大きくなる。ロボットアーム１１が動いた軌跡については、回転角検出部１７で検出される関節１２それぞれの回転角から求めてもよいし、作業環境検出部１９で検出される画像から求めてもよい。 The distance d indicates the distance between the position of the tip portion 15 of the robot arm 11 and the target reaching position, and the reward r1 increases as the distance d decreases. The trajectory of movement of the robot arm 11 may be obtained from the rotation angle of each joint 12 detected by the rotation angle detection unit 17, or may be obtained from the image detected by the work environment detection unit 19.

トルクｔは、アーム駆動部１６それぞれのトルクの総和を示し、トルクｔが大きくなるにつれて、報酬ｒ２は小さくなる。つまり、報酬ｒ２は、アーム駆動部１６の駆動電力に応じて計算される報酬である。 The torque t indicates the sum of the torques of the arm drive units 16, and the reward r2 decreases as the torque t increases. That is, the reward r2 is a reward calculated according to the drive power of the arm drive unit 16.

また、ここではアーム駆動部１６の駆動に応じた報酬ｒ２として、アーム駆動部１６のトルクを用いて説明しているが、アーム駆動部１６の駆動時間をカウントし、駆動時間の長さを用いて、駆動時間が長くなるにつれて、報酬ｒ２が小さくなるように、報酬ｒ２を計算するようにしてもよい。 Although the torque r of the arm driving unit 16 is used as the reward r2 according to the driving of the arm driving unit 16 here, the driving time of the arm driving unit 16 is counted and the length of the driving time is used. Then, the reward r2 may be calculated so that the reward r2 becomes smaller as the driving time becomes longer.

関数更新部２１３Ｂは、より高い報酬ｒが得られる行動ａを選択しやすくなるように関数Ｑを更新するもので、報酬計算部２１３Ａにより計算された報酬ｒに基づいて、ロボットアーム１１のある状態からある行動を選択する価値を示す行動価値関数を更新する。そして、ロボットアーム１１の行動を繰り返す中で、行動価値関数を更新していくことで、学習部２１３は、ある状態に対する最適な行動を学習し、行動価値関数は収束する。 The function updating unit 213B updates the function Q so that it is easier to select the action a for which a higher reward r is obtained. Based on the reward r calculated by the reward calculation unit 213A, the function update unit 213B is in a certain state. An action value function indicating the value of selecting an action from is updated. Then, by repeating the action value function while repeating the action of the robot arm 11, the learning unit 213 learns the optimum action for a certain state, and the action value function converges.

学習部２１３による学習結果としての行動価値関数については、すべての状態と行動との組に対して、その価値を報酬テーブル（行動価値テーブル）として、保持しておくことが可能で、学習部２１３は、上記報酬テーブルを報酬テーブル記憶部２４１に保存する。 Regarding the action value function as the learning result by the learning unit 213, the value can be held as a reward table (action value table) for all the combinations of states and actions, and the learning unit 213 Saves the reward table in the reward table storage unit 241.

ロボット制御装置２０の制御ユニット２１における処理動作の一例について、図４に示したフローチャートに基づいて説明する。なお、この処理動作は、ロボットアーム１１の動作過程においてわれる処理動作である。 An example of the processing operation in the control unit 21 of the robot controller 20 will be described based on the flowchart shown in FIG. It should be noted that this processing operation is a processing operation performed in the operation process of the robot arm 11.

状態観測部２１２が、回転角検出部１７、トルクセンサー１８、及び作業環境検出部１９による検出結果に基づいて、ロボットアーム１１の状態を観測し（Ｓ１）、報酬計算部２１３Ａが、状態観測部２１２により観測される、ロボットアーム１１の状態に基づいて、ロボットアーム１１の動いた軌跡、及びアーム駆動部１６の駆動に応じて報酬を計算する（Ｓ２）。 The state observing unit 212 observes the state of the robot arm 11 based on the detection results of the rotation angle detecting unit 17, the torque sensor 18, and the work environment detecting unit 19 (S1), and the reward calculating unit 213A changes the state observing unit. Based on the state of the robot arm 11 observed by 212, the reward is calculated according to the trajectory of the robot arm 11 and the drive of the arm driving unit 16 (S2).

続いて、関数更新部２１３Ｂが、報酬計算部２１３Ａにより計算された報酬に基づいて、行動価値関数を更新し（Ｓ３）、その後、処理はＳ１へ戻る。すなわち、Ｓ１乃至Ｓ３を繰り返すことにより、ロボット制御装置２０は、行動価値関数（報酬テーブル）の更新を継続して行う。 Subsequently, the function updating unit 213B updates the action value function based on the reward calculated by the reward calculating unit 213A (S3), and then the process returns to S1. That is, by repeating S1 to S3, the robot controller 20 continuously updates the action value function (reward table).

意思決定部２１４は、学習部２１３による学習結果（行動価値関数）に基づいて、ロボットアーム１１に行わせる行動を選択する。例えば、意思決定部２１４は、ある状態から最も価値のある行動を選択し、選択した内容を行動計画部２１５に出力する。 The decision making unit 214 selects an action to be performed by the robot arm 11 based on the learning result (action value function) by the learning unit 213. For example, the decision making unit 214 selects the most valuable action from a certain state and outputs the selected content to the action planning unit 215.

行動計画部２１５は、意思決定部２１４から入力した内容に基づいて、ロボット１０の行動計画を生成し、生成した行動計画を示す情報を、その情報の内容に応じて、アーム行動指示部２１６と、グリッパー行動指示部２１７とに出力する。例えば、行動計画部２１５は、ロボットアーム１１の先端部１５の軌跡を生成する。 The action plan unit 215 generates an action plan of the robot 10 based on the content input from the decision making unit 214, and provides information indicating the generated action plan to the arm action instruction unit 216 according to the content of the information. , To the gripper action instruction unit 217. For example, the action planning unit 215 generates a trajectory of the tip portion 15 of the robot arm 11.

アーム行動指示部２１６は、行動計画部２１５から入力した行動計画を示す情報に応じて、ロボットアーム１１の関節１２を駆動するアーム駆動部１６の動作を制御する駆動信号を生成し、アーム駆動部１６の駆動を制御する。 The arm action instruction unit 216 generates a drive signal for controlling the operation of the arm drive unit 16 that drives the joint 12 of the robot arm 11 according to the information indicating the action plan input from the action plan unit 215, and the arm drive unit 216. 16 drive is controlled.

グリッパー行動指示部２１７は、行動計画部２１５から入力した行動計画を示す情報に応じて、グリッパー駆動部４１Ｃの動作を制御する駆動信号を生成し、グリッパー駆動部４１Ｃの駆動を制御する。 The gripper action instruction unit 217 generates a drive signal for controlling the operation of the gripper drive unit 41C according to the information indicating the action plan input from the action plan unit 215, and controls the drive of the gripper drive unit 41C.

上記実施形態によれば、実際の学習結果から、ロボットアーム１１のある状態から価値の高い行動を自動的に選択することができるので、ユーザーの作業負担を大きくすることなく、ロボットアーム１１の先端部１５が目標到達位置へ到達するまでの効率のよい行動を計画することが可能となる。また、ロボットアーム１１の行動を学習するので、機器ごとのバラツキや、機種ごとの構成の違いにも柔軟に対応することが可能となるので、ユーザーの作業負担が大きくなるのを防ぐこともできる。 According to the above-described embodiment, it is possible to automatically select a valuable action from a certain state of the robot arm 11 based on the actual learning result. It is possible to plan an efficient action until the unit 15 reaches the target reaching position. In addition, since the behavior of the robot arm 11 is learned, it is possible to flexibly cope with the variation of each device and the difference of the configuration of each model, so that it is possible to prevent the work load on the user from becoming large. ..

また、報酬テーブル記憶部２４１で保存されている報酬テーブルを、通信インターフェイス部２６を介して、他のロボット制御装置へ送信し、当該他のロボット制御装置での初期テーブルとして用いるようにすれば、学習効率を高めることができる。また、報酬テーブルについては、ネットワーク上で保存して、他のロボット制御装置との間で共有することも可能である。 Further, if the reward table stored in the reward table storage unit 241 is transmitted to another robot control device via the communication interface unit 26 and used as an initial table in the other robot control device, Learning efficiency can be improved. Also, the reward table can be saved on the network and shared with other robot control devices.

図５は、第２実施形態に係る学習装置を有するロボット制御装置を含んで構成されるロボット制御システムの主要内部構成を概略的に示した機能ブロック図である。図１に示したロボット制御装置１とは、制御ユニット２１が設定受付部２１８を備える点で相違する。 FIG. 5 is a functional block diagram schematically showing a main internal configuration of a robot control system including a robot control device having a learning device according to the second embodiment. The robot controller 1 differs from the robot controller 1 shown in FIG. 1 in that the control unit 21 includes a setting reception unit 218.

設定受付部２１８は、ロボットアームの動いた軌跡及び前記関節駆動部の駆動それぞれに対する上記報酬の重みのユーザー設定を、ユーザーによる操作部２２の操作に応じて受け付ける。制御ユニット２１は、記憶部２４に記憶されている制御プログラムに従った上記プロセッサーによる動作により、更に設定受付部２１８としても機能する。但し、設定受付部２１８は、制御ユニット２１による制御プログラムに基づく動作によらず、ハードウェア回路により構成することも可能である。 The setting reception unit 218 receives user settings of the weight of the reward for each of the trajectory of the robot arm and the drive of the joint drive unit according to the operation of the operation unit 22 by the user. The control unit 21 further functions as the setting reception unit 218 by the operation of the processor according to the control program stored in the storage unit 24. However, the setting reception unit 218 can be configured by a hardware circuit instead of the operation based on the control program by the control unit 21.

報酬計算部２１３Ａは、設定受付部２１８が受け付けたユーザー設定の重みに従って、報酬ｒを計算する。例えば、アーム駆動部１６（駆動モーター）による消費電力量の削減よりも、ロボットアーム１１の動く距離を短くすること（つまり、作業時間を短くすること）をユーザーが優先したい場合は、重みｗ１を大きく設定すれば、数式１の１項目の変化量を捉えやすくなり、ユーザーの希望する行動価値関数を得ることが可能となる。 The reward calculation unit 213A calculates the reward r according to the weight of the user setting received by the setting reception unit 218. For example, when the user wants to prioritize shortening the moving distance of the robot arm 11 (that is, shortening the working time) rather than reducing the power consumption by the arm driving unit 16 (driving motor), the weight w1 is set. If it is set to a large value, it becomes easier to capture the amount of change of one item in Expression 1, and it becomes possible to obtain the action value function desired by the user.

従って、上記第２実施形態によれば、個別の重み付け設定により、ユーザー好みの行動価値関数を得て、ロボットアーム１１をユーザー好みに行動させることが可能となる。 Therefore, according to the second embodiment, it is possible to obtain the action value function of the user's preference and make the robot arm 11 act in the user's preference by individually setting the weight.

ところで、学習部２１３による学習を継続すると、最新の行動価値関数（報酬テーブル）よりも、過去の報酬テーブルの方が、より適切なものであったとか、ユーザーの好みに合っていた、といったことが生じることが考えられる。 By the way, when the learning by the learning unit 213 is continued, the past reward table is more appropriate than the latest action value function (reward table), or the user's preference is satisfied. May occur.

そこで、別の実施形態では、複数の報酬テーブルを報酬テーブル記憶部２４１で保存できるようにし、例えば、学習部２１３は、操作部２２を介して報酬テーブルの保存のユーザー指示を受け付けると、その時に使用している報酬テーブルを、報酬テーブル記憶部２４１に保存させ、操作部２２を介して過去の報酬テーブルの使用のユーザー指示を受け付けると、当該過去の報酬テーブルを使用するようにする。 Therefore, in another embodiment, a plurality of reward tables can be stored in the reward table storage unit 241, and, for example, when the learning unit 213 receives a user instruction to store the reward table via the operation unit 22, at that time, The reward table being used is stored in the reward table storage unit 241, and when a user instruction to use the past reward table is received via the operation unit 22, the past reward table is used.

学習部２１３によって十分な学習が行われ、行動価値関数（報酬テーブル）が収束していれば、それ以上、学習を行う必要はないが、ロボット１０の作業環境が大きく変化した場合には、再学習を行うのが好ましい。例えば、温度や湿度が大きく変化すると、ロボット１０を構成するハードウェアの特性が変わる場合があり、それまでの報酬テーブルではロボットアーム１１の最適な行動が得られないおそれがある。 If the learning unit 213 has sufficiently learned and the action value function (reward table) has converged, it is not necessary to perform any more learning, but if the work environment of the robot 10 changes significantly, It is preferable to carry out learning. For example, when the temperature or the humidity greatly changes, the characteristics of the hardware configuring the robot 10 may change, and the reward table up to that point may not be able to obtain the optimum behavior of the robot arm 11.

そこで、更なる別の実施形態では、作業環境検出部１９として、温度センサーや湿度センサーを設け、ロボット１０の電源をオフする時の温度や湿度を測定して保持しておき、ロボット１０の電源をオンした時の温度や湿度と比較して、予め定められた閾値以上の差が生じている場合には、学習部２１３が再学習を行うようにする。 Therefore, in still another embodiment, a temperature sensor and a humidity sensor are provided as the work environment detection unit 19, and the temperature and the humidity when the power of the robot 10 is turned off are measured and held. When the difference between the temperature and the humidity at the time of turning on is greater than or equal to a predetermined threshold value, the learning unit 213 re-learns.

また、作業環境に応じた報酬テーブルを獲得し、作業環境それぞれに応じた報酬テーブルを報酬テーブル２４１に保存させておき、意思決定部２１４は、作業環境に応じた報酬テーブルを報酬テーブル記憶部２４１から読み出して、ロボットアーム１１に行わせる行動を選択するようにしてもよい。 Further, the reward table according to the work environment is acquired, and the reward table according to each work environment is stored in the reward table 241, and the decision making unit 214 stores the reward table according to the work environment in the reward table storage unit 241. Alternatively, the action to be performed by the robot arm 11 may be selected.

なお、上記実施形態では、ロボット制御装置２０が本発明に係る学習装置を有する場合について説明しているが、学習装置を構成する各機能については、ロボット制御装置２０に外付けされていてもよい。その場合には、ロボット制御装置２０と学習装置とは、ＬＡＮチップなどの通信モジュールを備えるインターフェイスを備え、互いにデータの送受信が可能となるように接続されるものとする。 In the above-described embodiment, the case where the robot control device 20 has the learning device according to the present invention has been described, but each function constituting the learning device may be externally attached to the robot control device 20. .. In that case, the robot control device 20 and the learning device are provided with an interface including a communication module such as a LAN chip, and are connected so as to be able to transmit and receive data to and from each other.

本発明は上記実施の形態の構成に限られず種々の変形が可能である。また、上記実施形態では、図１乃至図５を用いて上記実施形態により示した構成及び処理は、本発明の一実施形態に過ぎず、本発明を当該構成及び処理に限定する趣旨ではない。 The present invention is not limited to the configuration of the above-described embodiment, and various modifications can be made. Further, in the above-described embodiment, the configurations and processes shown in the above-described embodiments with reference to FIGS. 1 to 5 are merely embodiments of the present invention, and the present invention is not intended to be limited to the configurations and processes.

１ロボット制御システム
１０ロボット
１１ロボットアーム
１２関節
１５先端部
１６アーム駆動部
１７回転角検出部
２０ロボット制御装置
４１グリッパー
２１１制御部
２１２状態観測部
２１３学習部
２１３Ａ報酬計算部
２１３Ｂ関数更新部
２１５行動計画部
２１８設定受付部
1 Robot Control System 10 Robot 11 Robot Arm 12 Joint 15 Tip 16 Arm Drive 17 Rotation Angle Detector 20 Robot Controller 41 Gripper 211 Controller 212 State Observer 213 Learning Part 213A Reward Calculator 213B Function Update 215 Action Plan Part 218 Setting reception part

Claims

複数の関節を有し、三次元空間を自在に移動可能なロボットアームと、
前記複数の関節それぞれに設けられた、前記関節を駆動する関節駆動部と、
前記ロボットアームの位置を含む、当該ロボットアームの状態を検出する状態検出部と、を備えるロボットの行動を学習する学習装置であって、
前記状態検出部による検出結果に基づいて、前記ロボットアームの状態を観測する状態観測部と、
予め設定された目標到達位置へ到達するまでの前記ロボットアームの行動過程において、前記状態観測部により観測される、ある時刻における前記ロボットアームの状態と、当該状態からの前記ロボットアームの行動と、当該行動後の前記ロボットアームの状態とを関連付けて、前記ロボットアームの行動を学習する学習部と、を備え、
前記学習部は、
前記ロボットアームの動いた軌跡、及び前記関節駆動部の駆動に応じて報酬を計算する報酬計算部と、
前記報酬計算部により計算された前記報酬に基づいて、前記ロボットアームのある状態からある行動を選択する価値を示す行動価値関数を更新する関数更新部と、を備える学習装置。 A robot arm that has multiple joints and can move freely in three-dimensional space,
A joint drive unit that is provided in each of the plurality of joints and that drives the joints;
A learning device for learning the behavior of a robot, comprising a state detection unit for detecting the state of the robot arm, including the position of the robot arm,
Based on the detection result by the state detection unit, a state observation unit for observing the state of the robot arm,
In the action process of the robot arm until reaching a preset target reaching position, the state of the robot arm at a certain time, which is observed by the state observation unit, and the action of the robot arm from the state, A learning unit that learns the behavior of the robot arm by associating with the state of the robot arm after the behavior,
The learning unit is
A locus of movement of the robot arm, and a reward calculation unit that calculates a reward according to the drive of the joint drive unit,
A function updating unit that updates a behavior value function indicating the value of selecting a certain action from a certain state of the robot arm based on the reward calculated by the reward calculating unit.

前記状態観測部は、前記状態検出部による検出結果に基づいて、前記ロボットアームの先端部の位置を観測し、
前記報酬計算部は、前記ロボットアームの前記先端部と前記目標到達位置との距離が短くなるにつれて、前記報酬を高くして前記計算を行う請求項１に記載の学習装置。 The state observation unit, based on the detection result by the state detection unit, observes the position of the tip of the robot arm,
The learning device according to claim 1, wherein the reward calculation unit performs the calculation by increasing the reward as the distance between the tip end portion of the robot arm and the target arrival position decreases.

前記状態検出部は、前記関節駆動部それぞれのトルクを検出し、
前記状態観測部は、前記状態検出部による検出結果に基づいて、前記関節駆動部それぞれのトルクを観測し、
前記報酬計算部は、前記関節駆動部それぞれのトルクの総和が大きくなるにつれて、前記報酬を小さくして前記計算を行う請求項１又は請求項２に記載の学習装置。 The state detection unit detects the torque of each joint drive unit,
The state observing unit observes the torque of each of the joint drive units based on the detection result by the state detecting unit,
The learning device according to claim 1, wherein the reward calculation unit performs the calculation while reducing the reward as the total torque of the joint drive units increases.

前記報酬計算部は、前記関節駆動部の駆動時間が長くなるにつれて、前記報酬を小さくして前記計算を行う請求項１又は請求項２に記載の学習装置。 The learning device according to claim 1, wherein the reward calculation unit performs the calculation by reducing the reward as the driving time of the joint drive unit increases.

前記ロボットアームの動いた軌跡及び前記関節駆動部の駆動それぞれに対する前記報酬の重みのユーザー設定を受け付ける設定受付部を更に備え、
前記報酬計算部は、前記設定受付部が受け付けたユーザー設定の重みに従って、前記報酬を計算する請求項１乃至請求項４のいずれかに記載の学習装置。 Further comprising a setting receiving unit that receives a user setting of the weight of the reward for each of the trajectory of the robot arm and the drive of the joint driving unit,
The learning device according to any one of claims 1 to 4, wherein the reward calculation unit calculates the reward according to the weight of the user setting received by the setting reception unit.

請求項１乃至請求項５のいずれかに記載の学習装置と、
前記学習装置による学習結果に基づいて、前記ロボットアームに行わせる行動を選択する意思決定部と、を備え、
前記意思決定部による意思決定に基づいて、前記ロボットアームの行動を制御するロボット制御装置。 A learning device according to any one of claims 1 to 5,
A decision unit that selects an action to be performed by the robot arm based on a learning result by the learning device;
A robot controller that controls the behavior of the robot arm based on the decision made by the decision making unit.

請求項６に記載のロボット制御装置と、
前記ロボットと、を備えるロボット制御システム。
A robot controller according to claim 6;
A robot control system comprising: the robot.