KR102281119B1

KR102281119B1 - Method for controlling 7-axis robot using reinforcement learning

Info

Publication number: KR102281119B1
Application number: KR1020190154831A
Authority: KR
Inventors: 안다운; 이청화
Original assignee: 한국생산기술연구원
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2021-07-26
Also published as: KR20210065738A

Abstract

본 발명의 일 실시 예는 7축 로봇의 제어에 강화학습 알고리즘(reinforcement learning algorithm)을 이용함으로써, 7자유도를 갖는 다관절 로봇에 대한 제어 성능을 향상시키는 기술을 제공한다. 본 발명의 실시 예에 따른 강화학습을 이용한 7축 로봇 제어 방법은, 순차적으로 연결된 제1 내지 제6링크부, 상기 제6링크부와 결합하고 작업을 수행하는 엔드이펙터(End-effector), 및 각각의 관절을 회전시키기 위한 제1 내지 제7서보모터를 포함하는 7축 로봇에 대해 강화학습을 구성하는 파리미터(parameter)를 설정하고, 상태(State)와 행동(Action)을 설정하는 설정 단계; 상기 엔드이펙터의 기준점의 위치 오차와 각도 오차에 대한 행동요령 평가(Policy Evaluation)가 일정한 기준 오차 이내에서 수행되고, 전체 오차가 최소인 행동(Action)에 대해서 행동요령 개선(Policy Improvement)의 보상(Reward)을 수행하는 보상 수행 단계; 전체 오차가 최소인 행동(Action)에 해당하는 최적의 행동요령(Optimal Policy)을 도출하는 행동요령 도출 단계; 및 상기 최적의 행동요령(Optimal Policy)에 따른 제어신호를 상기 제1 내지 제7서보모터 각각에 전달하여 상기 7축 로봇이 작동하는 작동 단계;를 포함한다.An embodiment of the present invention provides a technique for improving control performance of an articulated robot having 7 degrees of freedom by using a reinforcement learning algorithm to control a 7-axis robot. A 7-axis robot control method using reinforcement learning according to an embodiment of the present invention, sequentially connected first to sixth link units, an end-effector that combines with the sixth link unit and performs an operation, and A setting step of setting parameters constituting reinforcement learning for a 7-axis robot including first to seventh servo motors for rotating each joint, and setting a state and an action; Policy evaluation for the position error and angular error of the reference point of the end effector is performed within a certain standard error, and compensation of policy improvement for the action with the minimum overall error ( Reward) performing a reward execution step; an action rule deriving step of deriving an optimal policy corresponding to an action with a minimum overall error; and an operation step in which the 7-axis robot operates by transmitting a control signal according to the optimal policy to each of the first to seventh servo motors.

Description

강화학습을 이용한 7축 로봇 제어 방법 {METHOD FOR CONTROLLING 7-AXIS ROBOT USING REINFORCEMENT LEARNING}7-axis robot control method using reinforcement learning {METHOD FOR CONTROLLING 7-AXIS ROBOT USING REINFORCEMENT LEARNING}

본 발명은 강화학습을 이용한 7축 로봇 제어 방법에 관한 것으로, 더욱 상세하게는, 7축 로봇의 제어에 강화학습 알고리즘(reinforcement learning algorithm)을 이용함으로써, 7자유도를 갖는 다관절 로봇에 대한 제어 성능을 향상시키는 기술에 관한 것이다.The present invention relates to a method for controlling a 7-axis robot using reinforcement learning, and more particularly, by using a reinforcement learning algorithm to control the 7-axis robot, control performance for an articulated robot having 7 degrees of freedom technology to improve it.

미래의 4차 산업혁명의 주된 관심은 스마트팩토리를 이용한 제조업의 혁신이며, 다관절 로봇은 스마트팩토리를 구성하는 기본 구성단위(entity)로서, 다관절 로봇의 기술 발전이 스마트팩토리의 구현에 중요한 역할을 수행할 수 있다.The main interest of the 4th industrial revolution in the future is innovation in manufacturing using smart factories, and articulated robots are a basic entity constituting a smart factory, and technological development of articulated robots plays an important role in the realization of smart factories. can be performed.

최근, 다관절 로봇은 기존의 자동화 제어 방법에 다양한 첨단기술이 융합된 형태로 발전되고 있다. 특히, 5G기반의 연결성(connectivity)을 통해 로봇 간의 연결 뿐만 아니라, 공장 내의 사람과 환경으로부터 양질의 정보를 주고 받으며 실시간(real time) 최적 제어를 수행한다. 뿐만 아니라, 인공지능의 딥러닝(deep learning) 기술을 통해 로봇의 인지능력이 발달하고, 빅데이터 기반의 기술을 통해 고장 예지와 진단을 통해 유지보수 시스템과 의사결정 능력이 점점 발달하고 있다. 이러한 기술을 통해 로봇이 스스로 최적의 작업방법을 익히거나, 로봇들이 협동하여 수준 높은 작업을 수행하거나 축적된 정보를 통해 스마트팩토리를 고장과 위험요소로부터 관리하기도 한다.Recently, the articulated robot has been developed in the form of fusion of various advanced technologies with the existing automated control method. In particular, through 5G-based connectivity, not only the connection between robots but also high-quality information is exchanged from people and the environment in the factory, and real-time optimal control is performed. In addition, the cognitive abilities of robots are developing through the deep learning technology of artificial intelligence, and the maintenance system and decision-making ability are gradually developing through failure prediction and diagnosis through big data-based technology. Through these technologies, robots learn optimal work methods on their own, or robots cooperate to perform high-level tasks, or manage smart factories from failures and risk factors through accumulated information.

상기와 같은 스마트팩토리의 구현을 위한 다양한 작업의 수행을 위해, 7개의 자유도를 가지는 다관절 로봇을 인공지능 알고리즘을 통해 제어하는 방법에 대한 연구 개발이 수행되고 있는데, 7자유도를 갖는 다관절 로봇은 여유 관절(redundant joint)이라는 특성을 갖고 있어 조작성과 유연성이 높은 반면에 해석적인 어려움이 더욱 증가하는 문제점이 있다. 구체적으로, 7축 로봇은 이론상으로 역 기구학 식을 유도하는 것이 매우 어려운 문제점이 있다.In order to perform various tasks for the implementation of the smart factory as described above, research and development is being conducted on a method for controlling a multi-joint robot having 7 degrees of freedom through an artificial intelligence algorithm. Since it has the characteristic of a redundant joint, it has high operability and flexibility, but there is a problem in that the analytical difficulty further increases. Specifically, the 7-axis robot has a problem in that it is very difficult to derive the inverse kinematics equation in theory.

대한민국 공개특허 제10-2019-0040506호(발명의 명칭: 로봇 조작을 위한 심층 강화 학습)에서는, 하나 이상의 프로세서에 의해 구현되는 방법으로서, 복수의 로봇 각각에 의해, 각 에피소드가 태스크에 대한 강화 학습 정책을 나타내는 정책 신경망에 기초하여 태스크를 수행하는 탐색인, 복수의 에피소드의 수행 중에: 상기 로봇들에 의해 상기 에피소드 동안 생성된 로봇 경험 데이터의 인스턴스를 버퍼에 저장하는 단계와, 상기 로봇 경험 데이터의 각 인스턴스는 에피소드들 중 대응하는 하나의 에피소드 동안 생성되고, 상기 대응하는 에피소드에 대한 정책 신경망을 위한 대응 정책 파라미터를 갖는 정책 신경망을 사용하여 생성된 대응 출력에 적어도 부분적으로 기초하여 생성되고; 상기 정책 신경망의 업데이트된 정책 파라미터를 반복적으로 생성하는 단계와, 상기 반복적으로 생성하는 단계의 반복들 각각은 상기 반복 동안 버퍼내의 로봇 경험 데이터의 하나 이상의 인스턴스의 그룹을 사용하여 상기 업데이트된 정책 파라미터를 생성하는 단계를 포함하는 방법이 개시되어 있다.In Korean Patent Application Laid-Open No. 10-2019-0040506 (Title of the Invention: Deep Reinforcement Learning for Robot Manipulation), as a method implemented by one or more processors, each episode is reinforcement learning for a task by each of a plurality of robots. During execution of a plurality of episodes, which is a search performing a task based on a policy neural network representing a policy: storing in a buffer an instance of robot experience data generated during the episode by the robots; each instance is generated during a corresponding one of the episodes and is generated based, at least in part, on a corresponding output generated using the policy neural network having a corresponding policy parameter for the policy neural network for the corresponding episode; iteratively generating an updated policy parameter of the policy neural network, and each iteration of the iteratively generating the updated policy parameter using a group of one or more instances of robot experience data in a buffer during the iteration. A method comprising the step of generating is disclosed.

대한민국 공개특허 제10-2019-0040506호Republic of Korea Patent Publication No. 10-2019-0040506

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 강화학습을 통해 어려운 수식의 필요 없이 7축 로봇 팔을 제어할 수 있는 방법을 제공하는 것이다.An object of the present invention for solving the above problems is to provide a method capable of controlling a 7-axis robot arm without the need for difficult formulas through reinforcement learning.

그리고, 본 발명의 목적은, 해석이 용이하지 않은 식의 사용을 배제하여 7축 로봇에 대한 설계 제약을 최소화시키는 제어 방법을 제공하는 것이다.And, it is an object of the present invention to provide a control method that minimizes design constraints for a 7-axis robot by excluding the use of an expression that is not easy to analyze.

본 발명이 이루고자 하는 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다. The technical problems to be achieved by the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned can be clearly understood by those of ordinary skill in the art to which the present invention belongs from the description below. There will be.

상기와 같은 목적을 달성하기 위한 본 발명의 구성은, 순차적으로 연결된 제1 내지 제6링크부, 상기 제6링크부와 결합하고 작업을 수행하는 엔드이펙터(End-effector), 및 각각의 관절을 회전시키기 위한 제1 내지 제7서보모터를 포함하는 7축 로봇에 대해 강화학습을 구성하는 파리미터(parameter)를 설정하고, 상태(State)와 행동(Action)을 설정하는 설정 단계; 상기 엔드이펙터의 기준점의 위치 오차와 각도 오차에 대한 행동요령 평가(Policy Evaluation)가 일정한 기준 오차 이내에서 수행되고, 전체 오차가 최소인 행동(Action)에 대해서 행동요령 개선(Policy Improvement)의 보상(Reward)을 수행하는 보상 수행 단계; 전체 오차가 최소인 행동(Action)에 해당하는 최적의 행동요령(Optimal Policy)을 도출하는 행동요령 도출 단계; 및 상기 최적의 행동요령(Optimal Policy)에 따른 제어신호를 상기 제1 내지 제7서보모터 각각에 전달하여 상기 7축 로봇이 작동하는 작동 단계;를 포함한다. The configuration of the present invention for achieving the above object is a sequentially connected first to sixth link units, an end-effector that combines with the sixth link unit and performs an operation, and each joint. A setting step of setting parameters constituting reinforcement learning for a 7-axis robot including first to seventh servo motors for rotation, and setting a state and an action; Policy evaluation for the position error and angular error of the reference point of the end effector is performed within a certain standard error, and compensation of policy improvement for the action with the minimum overall error ( Reward) performing a reward execution step; an action rule deriving step of deriving an optimal policy corresponding to an action with a minimum overall error; and an operation step in which the 7-axis robot operates by transmitting a control signal according to the optimal policy to each of the first to seventh servo motors.

본 발명의 실시 예에 있어서, 상기 설정 단계에서, 상기 상태(State)는 상기 엔드이펙터의 기준점의 3차원 위치 좌표(x, y 및 z)일 수 있다.In an embodiment of the present invention, in the setting step, the state may be the three-dimensional position coordinates (x, y, and z) of the reference point of the end effector.

본 발명의 실시 예에 있어서, 상기 설정 단계에서, 상기 행동(Action)은 상기 제1 내지 제7서보모터 각각의 회전 각도(θ₁~ θ₇)일 수 있다.In an embodiment of the present invention, in the setting step, the action may be _{a rotation angle θ 1} to θ _{7 of each of the first to seventh servo motors.}

본 발명의 실시 예에 있어서, 상기 보상 수행 단계는, 현재 행동요령(Current policy)을 대체하는 후보 행동요령(Candidate policy)을 도출하는 후보 행동요령 연산 단계; 상기 후보 행동요령(Candidate policy)을 이용하여, 현재 상태(Current state)를 대체하는 후보 상태(Candidate state)를 도출하는 후보 상태 연산 단계; 및 상기 후보 상태(Candidate state)을 이용하여, 상기 엔드이펙터의 기준점 상태(State)의 오차(error)에 대한 함수인 상태행동 가치함수를 도출하는 가치함수 도출 단계;를 포함할 수 있다.In an embodiment of the present invention, the step of performing the reward may include: calculating a candidate action rule for deriving a candidate policy that replaces the current policy; a candidate state calculation step of deriving a candidate state replacing the current state by using the candidate policy; and a value function deriving step of deriving a state action value function that is a function of an error of the reference point state of the end effector by using the candidate state.

본 발명의 실시 예에 있어서, 상기 보상 수행 단계는, 상기 상태행동 가치함수의 한계 값(Q_limit)이 상기 상태행동 가치함수의 최소 값(Q_min) 이상인지 여부를 판단하는 한계 판단 단계;를 더 포함할 수 있다.In an embodiment of the present invention, the reward performing step includes a limit determination step of determining whether _{the limit value (Q limit} ) of the state action value function _{is greater than or equal to the minimum value (Q min ) of the state action value function;} may include more.

본 발명의 실시 예에 있어서, 상기 보상 수행 단계는, 상기 한계 판단 단계 수행 후, 다음 행동요령(next policy)을 도출하여 상기 다음 행동요령(next policy)을 상기 현재 행동요령(Current policy)에 대입하는 업데이트를 수행하는 행동요령 개선 단계;를 더 포함할 수 있다.In an embodiment of the present invention, in the step of performing the compensation, after the step of determining the limit is performed, a next policy is derived and the next policy is substituted into the current policy It may further include; an action method improvement step of performing an update.

본 발명의 실시 예에 있어서, 상기 7축 로봇은, 상부면이 평면으로 형성되는 베이스를 더 포함할 수 있다.In an embodiment of the present invention, the 7-axis robot may further include a base having a flat upper surface.

본 발명의 실시 예에 있어서, 상기 제1서보모터, 상기 제4서보모터 및 상기 제7서보모터는 상기 베이스의 상부면에 대해 수직 축인 수직회전축을 중심으로 회전력을 생성할 수 있다.In an embodiment of the present invention, the first servomotor, the fourth servomotor, and the seventh servomotor may generate rotational force around a vertical rotational axis that is a vertical axis with respect to the upper surface of the base.

본 발명의 실시 예에 있어서, 상기 제2서보모터, 상기 제3서보모터, 상기 제5서보모터 및 상기 제6서보모터는 상기 수직회전축에 수직된 축인 수평회전축을 중심으로 회전력을 생성할 수 있다.In an embodiment of the present invention, the second servomotor, the third servomotor, the fifth servomotor, and the sixth servomotor may generate rotational force around a horizontal rotational axis that is an axis perpendicular to the vertical rotational axis. .

상기와 같은 구성에 따른 본 발명의 효과는, 강화학습 알고리즘을 이용하여 7축 로봇의 작동을 제어함으로써, 역 기구학 식을 유도하지 않으면서도 7축 로봇에 대한 제어를 수행할 수 있고, 더 나아가, 7축 로봇에 대한 제어 성능을 향상시킬 수 있다는 것이다.The effect of the present invention according to the above configuration is that by controlling the operation of the 7-axis robot using a reinforcement learning algorithm, it is possible to control the 7-axis robot without inducing an inverse kinematic equation, and further, It is possible to improve the control performance of the 7-axis robot.

그리고, 본 발명의 효과는, 해석이 용이하지 않은 역 기구학 식 등의 사용을 배제하여 7축 로봇에 대한 설계 제약을 최소화시키는 제어 방법을 구현함으로써, 7축 로봇의 사용 범위를 확대시킬 수 있다는 것이다.And, the effect of the present invention is that it is possible to expand the range of use of the 7-axis robot by implementing a control method that minimizes design restrictions for the 7-axis robot by excluding the use of inverse kinematics that are not easy to analyze. .

본 발명의 효과는 상기한 효과로 한정되는 것은 아니며, 본 발명의 상세한 설명 또는 특허청구범위에 기재된 발명의 구성으로부터 추론 가능한 모든 효과를 포함하는 것으로 이해되어야 한다.The effects of the present invention are not limited to the above effects, but it should be understood to include all effects inferred from the configuration of the invention described in the detailed description or claims of the present invention.

도 1은 본 발명의 일 실시 예에 따른 7축 로봇에 대한 사시도이다.
도 2는 본 발명의 다른 실시 예에 따른 7축 로봇에 대한 사시도이다.
도 3은 본 발명의 일 실시 예에 따른 7축 로봇의 파라미터 설정에 대한 모식도이다.
도 4는 본 발명의 일 실시 예에 따른 보상 수행 단계와 행동요령 도출 단계의 수행에 대한 순서도이다.
도 5는 본 발명의 일 실시 예에 따른 행동요령의 결과에 대한 그래프이다.
도 6은 본 발명의 일 실시 예에 따른 7축 로봇의 작동 정확도에 대한 이미지이다.
도 7은 본 발명의 일 실시 예에 따른 7축 로봇의 작동 중 목적 함수 에러에 대한 그래프이다.1 is a perspective view of a 7-axis robot according to an embodiment of the present invention.
2 is a perspective view of a 7-axis robot according to another embodiment of the present invention.
3 is a schematic diagram for parameter setting of a 7-axis robot according to an embodiment of the present invention.
4 is a flowchart for performing a step of performing a reward and a step of deriving an action point according to an embodiment of the present invention.
5 is a graph of a result of an action tip according to an embodiment of the present invention.
6 is an image of the operation accuracy of the 7-axis robot according to an embodiment of the present invention.
7 is a graph of an objective function error during operation of a 7-axis robot according to an embodiment of the present invention.

이하에서는 첨부한 도면을 참조하여 본 발명을 설명하기로 한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 따라서 여기에서 설명하는 실시 예로 한정되는 것은 아니다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, the present invention will be described with reference to the accompanying drawings. However, the present invention may be embodied in several different forms, and thus is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결(접속, 접촉, 결합)"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 부재를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 구비할 수 있다는 것을 의미한다. Throughout the specification, when a part is said to be “connected (connected, contacted, coupled)” with another part, it is not only “directly connected” but also “indirectly connected” with another member interposed therebetween. "Including cases where In addition, when a part "includes" a certain component, this means that other components may be further provided without excluding other components unless otherwise stated.

본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. The terms used herein are used only to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprises" or "have" are intended to designate that the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification exist, but one or more other features It should be understood that it does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof.

이하 첨부된 도면을 참고하여 본 발명에 대하여 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시 예에 따른 7축 로봇에 대한 사시도이고, 도 2는 본 발명의 다른 실시 예에 따른 7축 로봇에 대한 사시도이며, 도 3은 본 발명의 일 실시 예에 따른 7축 로봇의 파라미터 설정에 대한 모식도이다. 그리고, 도 4는 본 발명의 일 실시 예에 따른 보상 수행 단계와 행동요령 도출 단계의 수행에 대한 순서도이다.1 is a perspective view of a 7-axis robot according to an embodiment of the present invention, FIG. 2 is a perspective view of a 7-axis robot according to another embodiment of the present invention, and FIG. 3 is a 7-axis robot according to an embodiment of the present invention. It is a schematic diagram for parameter setting of axis robot. And, FIG. 4 is a flowchart for performing the step of performing the reward and the step of deriving an action point according to an embodiment of the present invention.

먼저, 본 발명의 7축 로봇 제어 방법에 이용되는 본 발명의 7축 로봇에 대해 설명하기로 한다. 본 발명의 7축 로봇은, 7개의 자유도(DOF, degree-of-freedom)를 구비하는 다관절 로봇으로써, 순차적으로 연결된 제1 내지 제6링크부(260), 제6링크부(260)와 결합하고 작업을 수행하는 엔드이펙터(300)(End-effector), 및 각각의 관절을 회전시키기 위한 제1 내지 제7서보모터(170)를 포함할 수 있다. 그리고, 상부면이 평면으로 형성되는 베이스(400)를 더 포함할 수 있다. 여기서, 제1서보모터(110), 제4서보모터(140) 및 제7서보모터(170)는 베이스(400)의 상부면에 대해 수직 축인 수직회전축을 중심으로 회전력을 생성할 수 있다. 또한, 제2서보모터(120), 제3서보모터(130), 제5서보모터(150) 및 제6서보모터(160)는 수직회전축에 수직된 축인 수평회전축을 중심으로 회전력을 생성할 수 있다.First, the 7-axis robot of the present invention used in the method of controlling the 7-axis robot of the present invention will be described. The seven-axis robot of the present invention is an articulated robot having seven degrees of freedom (DOF), and the first to sixth link units 260 and the sixth link unit 260 are sequentially connected. It may include an end effector 300 (End-effector) that combines with and performs a task, and first to seventh servomotors 170 for rotating each joint. And, it may further include a base 400 having a flat upper surface. Here, the first servomotor 110 , the fourth servomotor 140 , and the seventh servomotor 170 may generate rotational force around a vertical rotational axis that is a vertical axis with respect to the upper surface of the base 400 . In addition, the second servo motor 120 , the third servo motor 130 , the fifth servo motor 150 , and the sixth servo motor 160 may generate a rotational force around a horizontal axis of rotation that is an axis perpendicular to the vertical axis of rotation. there is.

구체적으로, 제1링크부(210)의 일단은 베이스(400)와 결합하여 베이스(400)에 의해 고정 지지되고, 제1링크부(210)는 베이스(400)의 상부면에 대해 수직 축인 수직회전축을 중심으로 회전력을 생성하는 제1서보모터(110)를 구비할 수 있다. 또한, 제2링크부(220)의 일단이 제1링크부(210)의 타단과 결합하고, 제2링크부(220)는, 제1서보모터(110)에 의해 회전하며, 수직회전축에 수직된 축인 수평회전축을 중심으로 회전력을 생성하는 제2서보모터(120)를 구비할 수 있다.Specifically, one end of the first link part 210 is coupled to the base 400 and is fixedly supported by the base 400 , and the first link part 210 is a vertical axis perpendicular to the upper surface of the base 400 . A first servo motor 110 for generating a rotational force around a rotational axis may be provided. In addition, one end of the second link unit 220 is coupled to the other end of the first link unit 210 , and the second link unit 220 is rotated by the first servo motor 110 and is perpendicular to the vertical axis of rotation. It may be provided with a second servo motor 120 for generating a rotational force about the axis of the horizontal axis of rotation.

제3링크부(230)의 일단이 제2링크부(220)의 타단과 결합하고, 제3링크부(230)는, 제2서보모터(120)에 의해 회전하며, 수평회전축을 중심으로 회전력을 생성하는 제3서보모터(130)를 구비할 수 있다. 또한, 제4링크부(240)의 일단이 제3링크부(230)의 타단과 결합할 수 있다. 그리고, 제4링크부(240)는, 제3서보모터(130)에 의해 회전하며, 수직회전축을 중심으로 회전력을 생성하고 제3링크부(230)의 타단과 연결되는 제4서보모터(140)를 일단에 구비하고, 수평회전축을 중심으로 회전력을 생성하는 제5서보모터(150)를 타단에 구비할 수 있다. 꺽인 부위가 없는 다른 링크부들과 달리, 제4링크부(240)는 일단과 타단 사이에 꺽인 부위를 구비할 수 있다.One end of the third link unit 230 is coupled to the other end of the second link unit 220 , and the third link unit 230 is rotated by the second servo motor 120 , and is rotated around a horizontal axis of rotation. It may be provided with a third servo motor 130 for generating the. In addition, one end of the fourth link unit 240 may be coupled to the other end of the third link unit 230 . In addition, the fourth link unit 240 is rotated by the third servo motor 130 , generates a rotational force around a vertical axis of rotation, and a fourth servo motor 140 connected to the other end of the third link unit 230 . ) may be provided at one end, and a fifth servo motor 150 for generating a rotational force around a horizontal axis of rotation may be provided at the other end. Unlike other link parts that do not have a bent part, the fourth link part 240 may have a bent part between one end and the other end.

제5링크부(250)의 일단이 제4링크부(240)의 타단과 결합하고, 제5링크부(250)는, 제5서보모터(150)에 의해 회전하며, 수평회전축을 중심으로 회전력을 생성하는 제6서보모터(160)를 구비할 수 있다. 또한, 제6링크부(260)의 일단이 제5링크부(250)의 타단과 결합하고, 제6링크부(260)는, 제6서보모터(160)에 의해 회전하며, 수직회전축을 중심으로 회전력을 생성하는 제7서보모터(170)를 구비할 수 있다. 그리고, 엔드이펙터(300)의 일단이 제6링크부(260)의 타단과 결합하고, 엔드이펙터(300)는, 제7서보모터(170)에 의해 회전하고, 엔드이펙터(300)의 기준점인 엔드이펙터(300)의 타단이 작업 대상과 접촉 가능할 수 있다. 여기서, 엔드이펙터(300)가, 도 1과 도 3에서 보는 바와 같이, 도형 또는 글자 등을 형성하기 위한 펜의 형상인 경우 엔드이펙터(300)의 기준점이 엔드이펙터(300)의 타단이 되는 것이고, 도 2에서 보는 바와 같이, 엔드이펙터(300)가 집게 기기인 경우에는 기준점이 변경될 수도 있다. 다만, 이하의 기재에서는, 엔드이펙터(300)가 펜의 형상인 경우에 대해 설명하기로 한다.One end of the fifth link unit 250 is coupled to the other end of the fourth link unit 240 , and the fifth link unit 250 is rotated by the fifth servo motor 150 , and is rotated around a horizontal axis of rotation. A sixth servo motor 160 that generates In addition, one end of the sixth link unit 260 is coupled to the other end of the fifth link unit 250 , and the sixth link unit 260 is rotated by the sixth servo motor 160 , and is rotated around a vertical axis of rotation. A seventh servo motor 170 for generating a rotational force may be provided. And, one end of the end effector 300 is coupled to the other end of the sixth link unit 260 , the end effector 300 is rotated by the seventh servo motor 170 , and the reference point of the end effector 300 is The other end of the end effector 300 may be in contact with the work target. Here, when the end effector 300 is in the shape of a pen for forming figures or characters, as shown in FIGS. 1 and 3 , the reference point of the end effector 300 becomes the other end of the end effector 300 . , as shown in FIG. 2 , when the end effector 300 is a clamp device, the reference point may be changed. However, in the following description, a case in which the end effector 300 is shaped like a pen will be described.

본 발명의 7축 로봇 제어 방법에서는, 상기된 7축 로봇에 대해 강화학습을 구성하는 파리미터(parameter)를 설정하고, 상태(State)와 행동(Action)을 설정하는 설정 단계;가 수행될 수 있다. 여기서, 상태(State)는 엔드이펙터(300)의 기준점의 3차원 위치 좌표(x, y 및 z)일 수 있다. 그리고, 행동(Action)은 제1 내지 제7서보모터(170) 각각의 회전 각도(θ₁~ θ₇) 일 수 있다. 여기서, 엔드이펙터(300)의 작업 대상이 되며 베이스(400) 상에 위치하는 용지와 엔드이펙터(300)의 기준점이 접촉하는 경우, z좌표 값은 0으로 설정될 수 있다. 그리고, 엔드이펙터(300)의 기준점이 사각형인 용지의 어느 하나의 꼭지점에 위치하는 경우, x좌표 값과 y좌표 값은 각각 0으로 설정될 수 있다. 그리고, 도 1과 도 3에서와 같은 정지 상태(전체 서보모터가 PAUSE)인 경우, 각각의 회전 각도(θ₁~ θ₇)의 값이 0으로 설정될 수 있다.In the 7-axis robot control method of the present invention, a setting step of setting parameters constituting reinforcement learning for the above-described 7-axis robot, and setting a state and an action; may be performed . Here, the state may be the three-dimensional position coordinates (x, y, and z) of the reference point of the end effector 300 . And, the action may be the rotation angles θ ₁ to θ _{7 of} each of the first to seventh servo motors 170 . Here, when the reference point of the end effector 300 comes into contact with the paper, which is the work target of the end effector 300 and is positioned on the base 400 , the z-coordinate value may be set to zero. In addition, when the reference point of the end effector 300 is located at any one vertex of the rectangular paper, the x-coordinate value and the y-coordinate value may be set to 0, respectively. And, in the case of a stop state (the entire servomotor is PAUSE) as in FIGS. 1 and 3 , the value of each of the rotation angles θ ₁ to θ ₇ may be set to zero.

그리고, 각각의 서보모터(servo motor)의 회전 각도 변화(Δθ₁~Δθ₇)는, 시계 방향(CW), 반시계 방향(CCW) 및 정지(PAUSE)의 3가지 경우의 변화를 수행하며, 이에 따라, 7개의 서보모터에 의한 행동(Action)에 대한 경우의 수는 총 2,187(=3⁷)으로 형성되며, 이에 대한 정리는 하기의 (표 1)에서 확인할 수 있다. 다만, (표 1)에서 보는 바와 같이, 모든 서보모터가 정지된 경우(a₁₀₉₄)에 대해서는 강화학습에서 제외시킬 수 있다.And, the rotation angle change (Δθ ₁ ~ Δθ ₇ ) of each servo motor performs a change in three cases of a clockwise direction (CW), a counterclockwise direction (CCW) and a stop (PAUSE), Accordingly, the number of cases for the action by the seven servomotors ^{is formed as a total of 2,187 (= 3 7} ), and the theorem can be confirmed in (Table 1) below. However, as shown in (Table 1), when all servo motors are stopped (a ₁₀₉₄ ), it can be excluded from reinforcement learning.

ActionsActions The rotational variations of seven servo motorsThe rotational variations of seven servo motors a_n a _n Δθ₁ Δθ ₁ Δθ₂ Δθ ₂ Δθ₃ Δθ ₃ Δθ₄ Δθ ₄ Δθ₅ Δθ ₅ Δθ₆ Δθ ₆ Δθ₇ Δθ ₇ a₁ a ₁ CWCW CWCW CWCW CWCW CWCW CWCW CWCW a₂ a ₂ CWCW CWCW CWCW CWCW CWCW CWCW PAUSEPAUSE a₃ a ₃ CWCW CWCW CWCW CWCW CWCW CWCW CCWCCW ...... ...... ...... ...... ...... ...... ...... ...... a₁₀₉₃ a ₁₀₉₃ PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE CWCW a₁₀₉₄(N/A)a ₁₀₉₄ (N/A) PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE a₁₀₉₅ a ₁₀₉₅ PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE PAUSEPAUSE CCWCCW ...... ...... ...... ...... ...... ...... ...... ...... a₂₁₈₅ a ₂₁₈₅ CCWCCW CCWCCW CCWCCW CCWCCW CCWCCW CCWCCW CWCW a₂₁₈₆ a ₂₁₈₆ CCWCCW CCWCCW CCWCCW CCWCCW CCWCCW CCWCCW PAUSEPAUSE a₂₁₈₇ a ₂₁₈₇ CCWCCW CCWCCW CCWCCW CCWCCW CCWCCW CCWCCW CCWCCW

상기된 설정 단계 수행 이 후, 엔드이펙터(300)의 기준점의 위치 오차와 각도 오차에 대한 행동요령 평가(Policy Evaluation)가 일정한 기준 오차 이내에서 수행되고, 전체 오차가 최소인 행동(Action)에 대해서 행동요령 개선(Policy Improvement)의 보상(Reward)을 수행하는 보상 수행 단계;가 수행될 수 있다. 그리고, 상기된 보상 수행 단계 수행 이 후, 전체 오차가 최소인 행동(Action)에 해당하는 최적의 행동요령(Optimal Policy)을 도출하는 행동요령 도출 단계;가 수행될 수 있다. 보상 수행 단계와 행동요령 도출 단계는, 도 4에서 보는 바와 같이 반복 수행되는 행동요령 반복(Policy Iteration)에 의해 수행될 수 있으며, 이하, 행동요령 반복(Policy Iteration)에 대해 설명하기로 한다.After performing the above-described setting step, policy evaluation for the position error and angle error of the reference point of the end effector 300 is performed within a certain reference error, and for an action with a minimum overall error. A reward performing step of performing a reward of policy improvement; may be performed. Then, after performing the above-described reward performing step, an action rule derivation step of deriving an optimal policy corresponding to an action with a minimum overall error may be performed. The reward execution step and the action point deriving step may be performed by repeatedly performed action rule repetition (Policy Iteration) as shown in FIG. 4 , and hereinafter, policy iteration will be described.

먼저, 시작 단계인 단계 S100에서, 설정 단계에서 설정된 파리미터(parameter)에 대한 초기 상태(s₀)와 초기 행동요령(π₀)을 입력할 수 있다. 그리고, 단계 S210에서, 단계 S100에서 입력된 초기 상태와 초기 행동을 기준으로, 현재 상태(Current state)인 (s)를 확정할 수 있다. 다음으로, 단계 S220에서, 현재 행동요령(Current policy)인 (π)를 확정할 수 있다.First, in step S100, which is a starting step, an initial state (s ₀ ) and an initial action point (π ₀ ) for the parameters set in the setting step may be input. And, in step S210, based on the initial state and the initial action input in step S100, it is possible to determine the current state (s) (s). Next, in step S220, it is possible to determine the current policy (π).

다음으로, 단계 S230에서, 현재 행동요령(Current policy)를 대체하는 후보 행동요령(Candidate policy)을 도출하는 후보 행동요령 연산 단계가 수행될 수 있다. 각각의 에피소드(episode)에서, 본 발명의 7축 로봇 장치는, 초기 상태(s₀) 및 초기 행동요령(π₀)으로부터 지정된 에피소드 상태(s^e)로 이동함으로써, 엔드이펙터(300)를 초기 위치(또는, 초기 행동요령)에서 요구된 자세로 이동시킬 수 있다. 여기서, 첨자 e는 에피소드 번호를 의미하고, 행동요령 반복(Policy iteration)에서, 현재 행동요령(Current policy) (ð)이 하기의 [수식 1]과 [수식 2]에 의해 업데이트 되도록 구현할 수 있다.Next, in step S230 , a candidate action rule calculation step of deriving a candidate policy replacing the current policy may be performed. In each episode (episode), the 7-axis robot device of the present invention _{moves the end effector 300 from the initial state (s 0} ) and the initial action point (π ₀ ) to the specified episode state (s ^e ), thereby initializing the end effector 300 . It can be moved from the position (or initial action tips) to the required posture. Here, the subscript e means an episode number, and in policy iteration, it can be implemented so that the current policy (ð) is updated by the following [Equation 1] and [Equation 2].

[수식 1][Formula 1]

[수식 2][Formula 2]

여기서,

는 1x2186의 행렬(

₁₀₉₄제외)이며, 행동에 따라 다음 반복에 대한 후보 행동요령(Candidate policy)일 수 있다. a는 일련의 행동과 1x2186 매트릭스이다. 그리고, 학습률(learning rate) ρ는 얼마나 많은 현재 행동요령(Current policy)이 하기의 [수식 3]에 의해 다음 행동요령(next policy)으로 업데이트되는지에 대한 것이며, 초기 학습률(ρ₀)은 10일 수 있다. 이는, 실험적으로 결정되는 것으로써, 이에 한정되는 것은 아니다.here,

is a matrix of 1x2186 (

₁₀₉₄ ), and may be a candidate policy for the next iteration depending on the behavior. a is a sequence of actions and a 1x2186 matrix. And, the learning rate ρ is about how many current policies are updated to the next policy by the following [Equation 3], and the initial learning rate (ρ ₀ ) is 10 days can This is to be determined experimentally, but is not limited thereto.

[수식 3][Equation 3]

여기서, 비례 상수인 c는 6이며, 이와 같은 값도 실험적으로 결정될 수 있다. 다만, 학습률(ρ)이 5 내지 10의 범위를 초과하면 학습률(ρ)이 각각 5와 10으로 대체될 수 있다. Here, c, which is a proportionality constant, is 6, and such a value may also be experimentally determined. However, when the learning rate ρ exceeds the range of 5 to 10, the learning rate ρ may be replaced with 5 and 10, respectively.

그리고, 단계 S240에서, 후보 행동요령(Candidate policy)을 이용하여, 현재 상태(Current state)를 대체하는 후보 상태(Candidate state)를 도출하는 후보 상태 연산 단계가 수행될 수 있다. 상기와 같은 학습률(ρ)의 대체 값도 경험적으로 확정될 수 있으며, 기구학 방정식(f_FK, Forward Kinematics equation)을 통한 [수식 4]와 [수식 5]에 후보 행동요령(Candidate policy) (

)을 대입함으로써 후보 상태(Candidate state) (

)를 도출할 수 있다. 기구학 방정식은, 강화학습 프로그램에 저장된 함수일 수 있다. 이와 같은 기구학 방정식은, 강화학습 프로그램의 생성 전에 Denavit-Hartenberg 공식을 이용하여 본 발명의 7축 로봇 장치의 작동에 대한 순 기구학 식을 유도한 후, 강화학습 프로그램에 저장하여 함수와 같이 사용할 수 있다. 이는 기존의 공식을 이용하는 사항으로써, 상세한 설명은 생략하기로 한다.Then, in step S240, a candidate state calculation step of deriving a candidate state replacing the current state by using a candidate policy may be performed. The substitution value of the learning rate (ρ) as described above can also be empirically confirmed, and the candidate policy (Candidate policy) in [Equation 4] and [Equation 5] through the _{kinematic equation (f FK , Forward Kinematics equation) (}

) by substituting the candidate state (

) can be derived. The kinematic equation may be a function stored in the reinforcement learning program. Such a kinematic equation can be used as a function by deriving a pure kinematic equation for the operation of the 7-axis robot device of the present invention by using the Denavit-Hartenberg formula before creation of the reinforcement learning program, and then storing it in the reinforcement learning program. . This is a matter using an existing formula, and a detailed description thereof will be omitted.

[수식 4][Equation 4]

[수식 5][Equation 5]

여기서, 후보 상태(Candidate state) (

)는 1x2186의 행렬일 수 있다.Here, the candidate state (

) may be a 1x2186 matrix.

그 후, 단계 S250에서, 후보 상태(Candidate state) (

)를 이용하여, 엔드이펙터(300)의 기준점 상태(State)의 오차(error)에 대한 함수인 상태행동 가치함수(state-action-value function)를 도출하는 가치함수 도출 단계가 수행될 수 있다. 단계 S230 내지 단계 S260에 의한 행동요령 평가(policy evaluation)에서는, 후보 행동(candidate actions) 및 이와 관련된 모든 행동요령(policy)에 대해 Q로 표시된 상태행동 가치함수(state-action-value function)를 계산할 수 있다. 대부분의 상태행동 가치함수는, 미래의 모든 보상을 기대하도록 계산하는 Bellman-equation을 사용한다. 그러나, 실제로는, x, y, z, α, β 및 γ의 오류가 너무 많기 때문에, 수정된 상태행동 가치함수가 이용될 수 있다. 상태행동 가치함수(Q)는, 하기의 [수식 6]과 [수식 7]으로 표현된 엔드이펙터(300)의 상태 오차로 정의될 수 있다.Then, in step S250, a candidate state (

), a value function deriving step of deriving a state-action-value function that is a function for an error of the reference point state of the end effector 300 may be performed. In the policy evaluation according to steps S230 to S260, a state-action-value function denoted by Q is calculated for candidate actions and all policies related thereto. can Most state-action value functions use Bellman-equation, which calculates to expect all future rewards. However, in practice, since the errors of x, y, z, α, β and γ are too many, the modified state action value function can be used. The state action value function (Q) may be defined as a state error of the end effector 300 expressed by the following [Equation 6] and [Equation 7].

[수식 6][Equation 6]

[수식 7][Equation 7]

여기서, 수정된 상태행동 가치함수 Q는 1x2186의 행렬일 수 있다. f_PE는 행동요령 평가 함수이며, 하기의 [수식 8]에 의해 표현될 수 있다.Here, the modified state action value function Q may be a 1x2186 matrix. f _PE is an action point evaluation function, and can be expressed by the following [Equation 8].

[수식 8][Equation 8]

S_P와 S_O는 하기의 [수식 9]에 표시된 위치와 방향의 하위 상태일 수 있다. 그리고, k₁과 k₂는 하기의 [수식 10]에 의해 계산되며, 각각 4.31과 2.43의 값을 가질 수 있다.S _P and S _O may be sub-states of positions and directions shown in [Equation 9] below. And, k ₁ and k ₂ are calculated by the following [Equation 10], and may have values of 4.31 and 2.43, respectively.

[수식 9][Equation 9]

[수식 10][Equation 10]

,

여기서, 모든 파라미터의 제한 값(x_limit, y_limit, z_limit, α_limit, β_limit 및 γ_limit)을 정의함으로써, 오류에 대한 신뢰성과 강화학습에 의한 제어의 재현성을 제공할 수 있다. 구체적으로, 각각의 서보모터의 분해능은 각각의 링크부 간 결합 공간의 신뢰성 한계로 0.29도/펄스의 기준 파라미터로 이용될 수 있다. 그리고, 직교 공간(Cartesian space)의 신뢰성 한계로, 서보모터 각각의 회전 각도(θ₁~ θ₇)는, 0°, 90°, -90°, 0°, 0°, 0°, 0° 일 때 상기된 기구학 방정식(f_FK) 에서 계산될 수 있다. 여기서, 모든 행동(Actions), 0.29의 학습률(learning rate) 값, 2.3954의 x_limit 값, 0.5475의 y_limit 값, 0.8205의 z_limit 값, 0.0088의 α_limit 값, 1.1600의 β_limit 값, 그리고, 0.8701의 γ_limit 값이 이용될 수 있다.Here, by defining the limit values (x _limit , y _limit , z _limit , α _limit , β _limit and γ _limit ) of all parameters, reliability for errors and reproducibility of control by reinforcement learning can be provided. Specifically, the resolution of each servomotor can be used as a reference parameter of 0.29 degrees/pulse as the reliability limit of the coupling space between each link part. And, as the reliability limit of the Cartesian space, each rotation angle (θ ₁ to θ ₇ ) of each servomotor is 0°, 90°, -90°, 0°, 0°, 0°, 0° When can be calculated from the kinematic equation (f _FK ) described above. where all Actions, learning rate value of 0.29, x _limit value of 2.3954, y _limit value of 0.5475, z _limit value of 0.8205, α _limit value of 0.0088, β _limit value of 1.1600, and 0.8701 The γ _limit value of can be used.

또한, 단계 S260에서, 상태행동 가치함수의 한계 값(Q_limit)이 상태행동 가치함수의 최소 값(Q_min) 이상인지 여부를 판단하는 한계 판단 단계;를 수행할 수 있다. 상태행동 가치함수의 한계 값(Q_limit)은, 행동요령 반복(Policy Iteration)의 중지 한계 값이며, 하기의 [수식 11]에 의해 연산될 수 있다.Further, in step S260, a limit determination step of determining whether _{the limit value (Q limit} ) of the state action value function is equal _{to or greater than the minimum value (Q min ) of the state action value function; may be performed.} The limit value (Q _limit ) of the state action value function is a stop limit value of policy iteration, and can be calculated by the following [Equation 11].

[수식 11][Equation 11]

단계 S260에서, 중지 한계 값인 상태행동 가치함수의 한계 값(Q_limit)이 상태동작 가치함수의 최소 값(Q_min) 미만으로 판단되는 경우, 단계 S270으로 이동하여, 다음 행동요령(next policy)(π')을 도출하여 다음 행동요령(next policy)(π')을 단계 S220의 현재 행동요령(Current policy)에 대입하는 업데이트를 수행하는 행동요령 개선(Policy improvement) 단계;를 수행할 수 있다. 여기서, 다음 행동요령(next policy)(π')은 하기의 [수식 12]에 의해 계산될 수 있다. [수식 12]에 표시된 것처럼, 상태행동 가치함수의 최소 값(Q_min)의 동작이 다음 반복 동작으로 선택될 수 있다. 즉, 상기와 같은 단계 S270의 업데이트는, 한 가지 행동(Action)만 다음 행동요령(next policy)(π')에 영향을 준다는 데에 의미가 있을 수 있다. _{In step S260, if it is determined that the limit value (Q limit} ) of the state action value function, which is the stop limit value, _{is less than the minimum value (Q min} ) of the state action value function, the process moves to step S270, and the next policy ( A policy improvement step of performing an update by deriving π ') and substituting the next policy (π ') into the current policy of step S220; may be performed. Here, the next policy (π') can be calculated by the following [Equation 12]. As shown in [Equation 12], the action of the minimum value (Q _min ) of the state action value function may be selected as the next repeated action. That is, the update of step S270 as described above may be meaningful in that only one action affects the next policy π′.

[수식 12][Equation 12]

그리고, 단계 S260에서, 중지 한계 값인 상태행동 가치함수의 한계 값(Q_limit)이 상태동작 가치함수의 최소 값(Q_min) 이상으로 판단되는 경우, 단계 S300에서, 전체 오차가 최소인 행동(Action)에 해당하는 최적의 행동요령(Optimal Policy)(π*)을 도출하는 행동요령 도출 단계;가 수행될 수 있다.And, when it is determined in step S260 that the limit value (Q _limit ) of the state action value function _{, which is the stop limit value, is greater than or equal to the minimum value (Q min} ) of the state action value function, in step S300, the total error is the minimum Action (Action) ), an action rule deriving step of deriving an optimal policy (π*) corresponding to; may be performed.

단계 S300 수행 후, +1의 보상(reward)으로 다음 에피소드로 이동할 수 있고, 각 에피소드의 행동요령 반복(Policy Iteration)의 최대 수(i_MAX)는 50으로 설정될 수 있다. 다만, 이에 한정되는 것은 아니다. 에피소드가 행동요령 반복의 최대 수(i_MAX) 내에서 최적의 행동요령(Optimal Policy)을 찾지 못하면 강화학습 알고리즘이 하이퍼파라미터(hyperparameters)를 조정하고 재교육할 수 있다. 그리고, 요구되는 총 에피소드가 종료될 때까지 행동요령 반복(Policy Iteration)이 반복될 수 있다. After performing step S300, it is possible to move to the next episode with a reward of +1, and the maximum number of policy iterations (i _MAX ) of each episode may be set to 50. However, the present invention is not limited thereto. If an episode does not find an optimal policy within the maximum number of iterations (i _MAX ), the reinforcement learning algorithm can adjust hyperparameters and retrain. And, policy iteration may be repeated until the total required episodes are finished.

즉, 상기된 행동요령 도출 단계(S300) 수행 이 후, 단계 S410에서, 최적의 행동요령(Optimal Policy)에 따른 제어신호를 제1 내지 제7서보모터(170) 각각에 전달하여 7축 로봇이 작동하는 작동 단계;가 수행될 수 있고, 동시에 최종 상태(Last state)인지 판단할 수 있다. 최종 상태(Last state)로써 에피소드가 종료된 것으로 판단되는 경우, 단계 S500으로 이동하여, 행동요령 반복(Policy Iteration)이 종료될 수 있다. 그리고, 최종 상태(Last state)가 아니라고 판단되는 경우, 단계 S420으로 이동하여 다음 상태(next state)(s')를 단계 S220의 현재 상태(Current state)에 대입하는 업데이트를 수행하고, 행동요령 반복(Policy Iteration)이 다시 반복되어 수행될 수 있다.That is, after the above-described behavioral command deriving step (S300) is performed, in step S410, a control signal according to the optimal policy is transmitted to each of the first to seventh servo motors 170 so that the 7-axis robot is operated. An operation step of operating; may be performed, and at the same time it may be determined whether it is a final state. When it is determined that the episode has ended as the last state, the process moves to step S500, and policy iteration may be terminated. And, if it is determined that it is not the last state, it moves to step S420 and performs an update substituting the next state (s') into the current state of step S220, and repeats the action instructions (Policy Iteration) may be repeated again.

상기와 같은 7축 로봇의 실체적인 제어를 위하여, 본 발명의 7축 로봇 제어 방법을 실행하는 프로그램을 기록하여 컴퓨터 판독 가능한 것을 특징으로 하는 기록매체가 제조될 수 있다.For the actual control of the 7-axis robot as described above, a computer-readable recording medium may be manufactured by recording a program for executing the 7-axis robot control method of the present invention.

상기와 같이, 강화학습 알고리즘을 이용하여 7축 로봇의 작동을 제어함으로써, 역 기구학 식을 유도하지 않으면서도 7축 로봇에 대한 제어를 수행할 수 있고, 더 나아가, 7축 로봇에 대한 제어 성능을 향상시킬 수 있다.As described above, by controlling the operation of the 7-axis robot using the reinforcement learning algorithm, it is possible to control the 7-axis robot without inducing an inverse kinematic equation, and furthermore, the control performance for the 7-axis robot can be improved. can be improved

도 5는 본 발명의 일 실시 예에 따른 행동요령의 결과에 대한 그래프이다. 도 5의 (a)는, 각각의 상태(state)에 대해서 제1서보모터(110) 내지 제4서보모터(140) 각각의 최적의 행동요령(Optimal Policy), 즉, 최적의 회전 각도(Degree)를 나타낸 것이고, 도 5의 (b)는 제5서보모터(150) 내지 제7서보모터(170) 각각의 최적의 행동요령(Optimal Policy), 즉, 최적의 회전 각도(Degree)를 나타낸 것이다. 그리고, 도 6은 본 발명의 일 실시 예에 따른 7축 로봇의 작동 정확도에 대한 이미지이며, 도 7은 본 발명의 일 실시 예에 따른 7축 로봇의 작동 중 목적 함수 에러에 대한 그래프이다.5 is a graph of a result of an action tip according to an embodiment of the present invention. 5 (a) is, for each state (state), the first servo motor 110 to the fourth servo motor 140, each of the optimal behavior (Optimal Policy), that is, the optimal rotation angle (Degree) ), and (b) of FIG. 5 shows an optimal policy of each of the fifth servo motor 150 to the seventh servo motor 170, that is, the optimal rotation angle (Degree). . 6 is an image of the operation accuracy of the 7-axis robot according to an embodiment of the present invention, and FIG. 7 is a graph of the objective function error during operation of the 7-axis robot according to an embodiment of the present invention.

도 6에서, 총 121(11x11)개의 마름모꼴 형상은 요구되는 상태(state)에 대한 표시이고, 요구되는 상태(state)에 대한 표시 각각에 표시된 점 형상은 최적의 행동요령(Optimal Policy)에 의한 결과일 수 있다. 그리고, 도 7에서, E_ref라인은 에러 기준 값을 의미하고, E라인은 목적 함수 에러(Objective Function Error) 값을 나타낼 수 있다.In FIG. 6 , a total of 121 (11x11) rhombic shapes are indications of the required state, and the dot shape displayed in each indication of the required state is the result of the optimal policy. can be And, in FIG. 7 , the E _ref line may indicate an error reference value, and the E line may indicate an objective function error value.

도 5에서 보는 바와 같이, 각각의 서보모터의 최적의 행동요령(Optimal Policy)이 각각 상이하게 형성되면서, 도 6에서 보는 바와 같이, 각각의 상태(state)에 대한 최적의 행동요령(Optimal Policy)이 수행됨을 확인할 수 있다. 즉, 본 발명의 실시 예에서, 7축 로봇의 에피소드는 사전에 정해진 각각의 마름모꼴 형상(상태, state)에 점 형상을 형성하는 것이고, 도 6에서 이와 같은 에피소드가 최적의 행동(Action)으로 수행됨을 확인할 수 있다. 그리고, 도 7에서 보는 바와 같이, 이와 같은 최적의 행동요령(Optimal Policy)에 의한 7축 로봇의 작동 오차가 에러 기준 값 미만으로 계속 유지되어, 7축 로봇의 작동 성능이 향상됨을 확인할 수 있다.As shown in FIG. 5, as the optimal policy of each servomotor is formed differently, as shown in FIG. 6, the optimal policy for each state (Optimal Policy) You can check that this is done. That is, in an embodiment of the present invention, the episode of the 7-axis robot is to form a point shape in each predetermined rhombus shape (state, state), and in FIG. 6 , such an episode is performed as an optimal action. can confirm. And, as shown in FIG. 7 , it can be confirmed that the operation error of the 7-axis robot is continuously maintained below the error reference value by this optimal policy, so that the operation performance of the 7-axis robot is improved.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The description of the present invention described above is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention.

110 : 제1서보모터 120 : 제2서보모터
130 : 제3서보모터 140 : 제4서보모터
150 : 제5서보모터 160 : 제6서보모터
170 : 제7서보모터 210 : 제1링크부
220 : 제2링크부 230 : 제3링크부
240 : 제4링크부 250 : 제5링크부
260 : 제6링크부 300 : 엔드이펙터
400 : 베이스 110: first servo motor 120: second servo motor
130: third servo motor 140: fourth servo motor
150: fifth servo motor 160: sixth servo motor
170: seventh servo motor 210: first link part
220: second link unit 230: third link unit
240: fourth link unit 250: fifth link unit
260: sixth link unit 300: end effector
400: base

Claims

순차적으로 연결된 제1 내지 제6링크부, 상기 제6링크부와 결합하고 작업을 수행하는 엔드이펙터(End-effector), 및 각각의 관절을 회전시키기 위한 제1 내지 제7서보모터를 포함하는 7축 로봇에 대해 강화학습을 구성하는 파리미터(parameter)를 설정하고, 상태(State)와 행동(Action)을 설정하는 설정 단계;
상기 엔드이펙터의 기준점의 위치 오차와 각도 오차에 대한 행동요령 평가(Policy Evaluation)가 일정한 기준 오차 이내에서 수행되고, 전체 오차가 최소인 행동(Action)에 대해서 행동요령 개선(Policy Improvement)의 보상(Reward)을 수행하는 보상 수행 단계;
전체 오차가 최소인 행동(Action)에 해당하는 최적의 행동요령(Optimal Policy)을 도출하는 행동요령 도출 단계; 및
상기 최적의 행동요령(Optimal Policy)에 따른 제어신호를 상기 제1 내지 제7서보모터 각각에 전달하여 상기 7축 로봇이 작동하는 작동 단계;를 포함하고,
상기 보상 수행 단계는, 현재 행동요령(Current policy)을 대체하는 후보 행동요령(Candidate policy)을 도출하는 후보 행동요령 연산 단계;를 포함하며,
상기 후보 행동요령 연산 단계에서는, 행동요령 반복(Policy iteration)에서, 현재 행동요령(Current policy)(ð)이 하기의 [수식 1]과 [수식 2]에 의해 업데이트 되도록 구현하는 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
[수식 1]

[수식 2]

여기서,

는 1x2186의 행렬(

₁₀₉₄(모든 서보모터 정지 경우) 제외)이며, 행동에 따라 다음 반복에 대한 후보 행동요령(Candidate policy)이다. a는 일련의 행동과 1x2186 매트릭스이다. 그리고, 학습률(learning rate) ρ는 현재 행동요령(Current policy)이 하기의 [수식 3]에 의해 다음 행동요령(next policy)으로 업데이트되는지에 대한 값이며, 초기 학습률(ρ₀)은 임의의 상수이다.
[수식 3]

여기서, c는 임의의 상수이며, Q_min은 상태행동 가치함수의 최소 값으로써 임의 설정 값이다.
7 including first to sixth link units sequentially connected, an end-effector for coupling with the sixth link unit and performing an operation, and first to seventh servo motors for rotating each joint A setting step of setting parameters constituting reinforcement learning for the axis robot, and setting a state and an action;
Policy evaluation for the position error and angle error of the reference point of the end effector is performed within a certain standard error, and compensation for policy improvement for the action with the smallest overall error ( Reward) performing a reward execution step;
an action rule deriving step of deriving an optimal policy corresponding to an action with a minimum overall error; and
An operation step in which the 7-axis robot operates by transmitting a control signal according to the optimal policy to each of the first to seventh servo motors;
The reward performing step includes a candidate action rule calculation step of deriving a candidate action plan that replaces the current policy.
Reinforcement, characterized in that in the step of calculating the candidate behavioral tips, in the policy iteration, the current policy (ð) is updated by the following [Equation 1] and [Equation 2] 7-axis robot control method using learning.
[Formula 1]

[Formula 2]

here,

is a matrix of 1x2186 (

₁₀₉₄ (except when all servomotors are stopped), which is a candidate policy for the next iteration according to the behavior. a is a sequence of actions and a 1x2186 matrix. And, the learning rate ρ is a value of whether the current policy is updated to the next policy by the following [Equation 3], and the initial learning rate ρ ₀ is an arbitrary constant am.
[Equation 3]

Here, c is an arbitrary constant, and Q _min is an arbitrary set value as the minimum value of the state action value function.

청구항 1에 있어서,
상기 설정 단계에서, 상기 상태(State)는 상기 엔드이펙터의 기준점의 3차원 위치 좌표(x, y 및 z)인 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
The method according to claim 1,
In the setting step, the state (State) is a 7-axis robot control method using reinforcement learning, characterized in that the three-dimensional position coordinates (x, y and z) of the reference point of the end effector.

청구항 1에 있어서,
상기 설정 단계에서, 상기 행동(Action)은 상기 제1 내지 제7서보모터 각각의 회전 각도(θ₁~ θ₇)인 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
The method according to claim 1,
In the setting step, the action is a 7-axis robot control method using reinforcement learning, characterized in that _{the rotation angles (θ 1} ~ θ ₇ ) of each of the first to seventh servo motors.

청구항 1에 있어서,
상기 보상 수행 단계는,
상기 후보 행동요령(Candidate policy)을 이용하여, 현재 상태(Current state)를 대체하는 후보 상태(Candidate state)를 도출하는 후보 상태 연산 단계; 및
상기 후보 상태(Candidate state)을 이용하여, 상기 엔드이펙터의 기준점 상태(State)의 오차(error)에 대한 함수인 상태행동 가치함수를 도출하는 가치함수 도출 단계;를 더 포함하는 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
The method according to claim 1,
The step of performing the compensation is
a candidate state calculation step of deriving a candidate state replacing the current state by using the candidate policy; and
Reinforcement, characterized by further comprising: a value function deriving step of deriving a state action value function that is a function of the error of the reference point state of the end effector by using the candidate state 7-axis robot control method using learning.

청구항 4에 있어서,
상기 보상 수행 단계는, 상기 상태행동 가치함수의 한계 값(Q_limit)이 상기 상태행동 가치함수의 최소 값(Q_min) 이상인지 여부를 판단하는 한계 판단 단계;를 더 포함하는 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
5. The method according to claim 4,
Reinforcement, characterized in that the step of performing the reward further comprises a limit judgment step of determining whether _{the limit value (Q limit} ) of the state behavior value function is equal _{to or greater than the minimum value (Q min ) of the state behavior value function;} 7-axis robot control method using learning.

청구항 5에 있어서,
상기 보상 수행 단계는, 상기 한계 판단 단계 수행 후, 다음 행동요령(next policy)을 도출하여 상기 다음 행동요령(next policy)을 상기 현재 행동요령(Current policy)에 대입하는 업데이트를 수행하는 행동요령 개선 단계;를 더 포함하는 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
6. The method of claim 5,
In the reward performing step, after the limit determination step is performed, a next policy is derived and an update is performed by substituting the next policy into the current policy. Step; 7-axis robot control method using reinforcement learning, characterized in that it further comprises.

청구항 1에 있어서,
상기 7축 로봇은, 상부면이 평면으로 형성되는 베이스를 더 포함하는 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
The method according to claim 1,
The 7-axis robot, a 7-axis robot control method using reinforcement learning, characterized in that it further comprises a base in which the upper surface is formed in a plane.

청구항 7에 있어서,
상기 제1서보모터, 상기 제4서보모터 및 상기 제7서보모터는 상기 베이스의 상부면에 대해 수직 축인 수직회전축을 중심으로 회전력을 생성하는 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
8. The method of claim 7,
The 7-axis robot control method using reinforcement learning, characterized in that the first servomotor, the fourth servomotor, and the seventh servomotor generate rotational force around a vertical rotational axis that is a vertical axis with respect to the upper surface of the base.

청구항 8에 있어서,
상기 제2서보모터, 상기 제3서보모터, 상기 제5서보모터 및 상기 제6서보모터는 상기 수직회전축에 수직된 축인 수평회전축을 중심으로 회전력을 생성하는 것을 특징으로 하는 강화학습을 이용한 7축 로봇 제어 방법.
9. The method of claim 8,
The second servomotor, the third servomotor, the fifth servomotor and the sixth servomotor generate rotational force around a horizontal rotational axis that is an axis perpendicular to the vertical rotational axis 7-axis using reinforcement learning How to control a robot.

청구항 1 내지 청구항 9 중 선택되는 어느 하나의 항에 기재된 강화학습을 이용한 7축 로봇 제어 방법을 실행하는 프로그램을 기록하여 컴퓨터 판독 가능한 것을 특징으로 하는 기록매체.
10. A recording medium characterized in that it is computer-readable by recording a program for executing the 7-axis robot control method using reinforcement learning according to any one of claims 1 to 9.