WO2020134254A1

WO2020134254A1 - Method employing reinforcement learning to optimize trajectory of spray painting robot

Info

Publication number: WO2020134254A1
Application number: PCT/CN2019/108333
Authority: WO
Inventors: 智鹏飞; 宦键; 王馨馨; 陈伟; 王伟然; 刘俊杰; 刘浩
Original assignee: 南京芊玥机器人科技有限公司
Priority date: 2018-12-27
Filing date: 2019-09-27
Publication date: 2020-07-02
Also published as: CN109521774A; CN109521774B

Abstract

A method employing reinforcement learning to optimize a trajectory of a spray painting robot. The method utilizes image acquisition, image processing, three-dimensional image reconstruction, image discretization, the selection of an initial point, and the selection of the optimal trajectory strategy so as to determine the optimal spray painting trajectory, wherein image processing comprises camera calibration, image correction, and stereo matching. The invention is capable of planning feasible routes according to different conditions in different environments, and has the advantages of short decision-making time, high success rates, and meeting real-time requirements of online planning, thereby remedying the defects of poor real-time performance and a large calculation amount in conventional route planning methods for mechanical arms.

Description

一种基于强化学习的喷涂机器人轨迹优化方法Method for trajectory optimization of spraying robot based on reinforcement learning

技术领域Technical field

本发明属于智能算法控制领域，特别涉及一种基于强化学习的喷涂机器人轨迹优化方法。The invention belongs to the field of intelligent algorithm control, and particularly relates to a trajectory optimization method of a spraying robot based on reinforcement learning.

背景技术Background technique

大部分喷涂机器人采用示教喷涂，根据技术员工预先设定好的轨迹进行喷涂，这种喷涂方法根据前人喷涂的经验进行有效的喷涂，但是前期编程工作较大，而且轨迹优化不明显，效率较低，涂料浪费相对较大。Most spraying robots use teaching spraying, spraying according to the trajectories pre-set by technical staff. This spraying method performs effective spraying based on the experience of predecessors, but the pre-programming work is relatively large, and the trajectory optimization is not obvious, and the efficiency Lower, relatively large paint waste.

基于强化学习的轨迹优化利用强化学习通过机器人对环境进行建模，在机器内部模拟出与环境相同或类似的状况，进行喷涂轨迹优化。在已有的轨迹优化算法中，通常都是概率路图法、快速搜索树法及人工势场法等。此类方法在将对环境建模或者模拟空间时，需要对机械臂的多种姿态进行采样，并通过运动学方程判断当前动作是否合理，计算量大；当环境中的障碍物和目标位置发生改变时，需要对新的环境重新计算构型空间的映射，难以达到动态实时规划的目的。另外，此类方法需要事先对机械臂构建精确的物理模型，建模的偏差会直接影响机械臂控制的效果。Reinforcement learning-based trajectory optimization uses reinforcement learning to model the environment through a robot, and simulates the same or similar conditions inside the machine as the environment to perform spray trajectory optimization. Among the existing trajectory optimization algorithms, the probabilistic roadmap method, fast search tree method and artificial potential field method are usually used. When this kind of method will model the environment or simulate the space, it is necessary to sample a variety of postures of the robotic arm, and determine whether the current movement is reasonable through the kinematics equation, and the amount of calculation is large; when obstacles and target positions in the environment occur When changing, it is necessary to recalculate the mapping of the configuration space for the new environment, and it is difficult to achieve the purpose of dynamic real-time planning. In addition, this kind of method needs to construct an accurate physical model for the robot arm in advance, and the deviation of modeling will directly affect the control effect of the robot arm.

针对目前强化学习的方法，强化学习方法以马尔科夫决策过程来描述问题，通过智能体与环境的互动积累经验，并且不断更新智能体的策略，使它做出的决策能够获得更高的奖励。相比于传统方法，该方法不需要对机械臂构建物理模型，在训练完成后，策略能够直接根据当前环境状态输出下一步的决策，由于网络进行一次前向计算的计算量很少，因此能够实现喷枪的在线规划与实时控制。In view of the current reinforcement learning methods, the reinforcement learning method describes the problem in the Markov decision process, accumulates experience through the interaction between the agent and the environment, and constantly updates the agent's strategy so that the decisions it makes can receive higher rewards . Compared with the traditional method, this method does not need to build a physical model of the robotic arm. After training, the strategy can directly output the next decision based on the current environmental state. Since the network has a small amount of forward calculation, it can Realize the online planning and real-time control of the spray gun.

发明内容Summary of the invention

本发明要解决的技术问题是提供一种基于强化学习的喷涂机器人轨迹优化方法，解决了喷枪在喷涂过程中，能够实现有效的在线规划与实时控制。The technical problem to be solved by the present invention is to provide a trajectory optimization method of a spraying robot based on reinforcement learning, which solves that the spray gun can achieve effective online planning and real-time control during the spraying process.

为解决上述技术问题，本发明提供了一种基于强化学习的喷涂机器人轨迹优化方法，其特征在于：所述轨迹优化方法通过图像获取、图像处理、图形三维重建、图形离散化、选取初始点和选择最优轨迹策略，确定出最优喷涂轨迹，其中，所述图像处理包括摄像机标定、图像校正和立体匹配；所述方法包括：To solve the above technical problems, the present invention provides a trajectory optimization method for spraying robots based on reinforcement learning, characterized in that the trajectory optimization method uses image acquisition, image processing, three-dimensional reconstruction of graphics, discretization of graphics, selection of initial points and An optimal trajectory strategy is selected to determine the optimal spraying trajectory, wherein the image processing includes camera calibration, image correction, and stereo matching; the method includes:

步骤1：图像获取：采用双目视觉***中的两台相同的相机同时获取同一个场景中的目标物体图像,然后,对获取的图像进行预处理；Step 1: Image acquisition: Two identical cameras in the binocular vision system are used to simultaneously acquire target object images in the same scene, and then, the acquired images are preprocessed;

步骤2：摄像机标定：对于获取的畸变向量，消除它的径向和切线方向上的镜头畸变，获得无畸变图像，同时建立摄像机的成像模型，确定目标点与像素点之间的对应关系；Step 2: Camera calibration: For the obtained distortion vector, remove the lens distortion in the radial and tangent directions to obtain an undistorted image, and at the same time establish an imaging model of the camera to determine the correspondence between the target point and the pixel point;

步骤3：图像校正：通过线性变换使处理后的无畸变图像中的共轭极线位于同一水平线上，把二维空间的匹配问题化简为一维空间的求解问题；Step 3: Image correction: Through linear transformation, the conjugate polar lines in the processed undistorted image are on the same horizontal line, and the matching problem in two-dimensional space is reduced to the solution problem in one-dimensional space;

步骤4：立体匹配：采用立体匹配算法得到校准后的图像与原图像的视差值，然后利用这个视差值得到每两幅图像之间的稀疏匹配，再通过优化算法，获得稠密匹配；Step 4: Stereo matching: Use the stereo matching algorithm to obtain the disparity value between the calibrated image and the original image, and then use this disparity value to obtain the sparse matching between each two images, and then use the optimization algorithm to obtain a dense match;

步骤5：三维重建：采用三角测量原理计算获取的立体匹配图像的深度值，得到稠密的三维空间点云，再对获取的三维空间点云进行网格化和差值计算，得到物体的三维结构模型；Step 5: 3D reconstruction: use the principle of triangulation to calculate the depth value of the obtained stereo matching image to obtain a dense 3D space point cloud, and then perform gridding and difference calculation on the obtained 3D space point cloud to obtain the 3D structure of the object model;

步骤6：图形离散化：通过双目视觉***对目标物体进行目标获取，对空间进行离散化，即将目标转化为点的集合；Step 6: Discretization of graphics: Obtain the target object through the binocular vision system, discretize the space, that is, convert the target into a set of points;

步骤7:选取初始点；Step 7: Pick the initial point;

步骤8：选择最优轨迹策略：将决策策略π定义为一个函数，它将一个状态映射到一个动作s→a，即当前状态s执行动作a；强化学习的目标函数为：Step 8: Select the optimal trajectory strategy: define the decision strategy π as a function, which maps a state to an action s→a, that is, the current state s executes action a; the objective function of reinforcement learning is:

Q ^π(s,a)＝E(R(s ₀)+γR(s ₁)+γ ²R(s ₂)+...|s ₀＝s,a ₀＝a,π) (1) Q ^π (s,a)=E(R(s ₀ )+γR(s ₁ )+γ ² R(s ₂ )+...|s ₀ =s,a ₀ =a,π) (1)

Q ^π(s,a)表示当前初始状态s下，采取动作a之后依照策略π的决策运动所能获得期望收益，即找到最优策略π ^*，使得π ^*＝arg max Q ^π(s,a)。 Q ^π (s,a) means that in the current initial state s, after taking action a, the decision motion according to strategy π can obtain the expected return, that is, to find the optimal strategy π ^* , such that π ^* =arg max Q ^π (s,a ).

本发明的优点在于：The advantages of the present invention are:

(1)本发明基于强化学习的喷涂机器人轨迹优化方法，采用双目视觉获取目标的位置空间信息，精确度较高，对于一些喷涂工件的大曲率或者凹点凸点能够很好的识别，效率较高；(1) The trajectory optimization method of spraying robot based on reinforcement learning of the present invention adopts binocular vision to obtain the location space information of the target, with high accuracy, and can be well recognized for the large curvature or concave and convex points of some sprayed workpieces, and the efficiency Higher

(2)本发明基于强化学习的喷涂机器人轨迹优化方法，该方法能够在不同的环境中根据不同的状态规划出可行路径，并且决策时间短、成功率高，能够满足在线规划的实时性要求，从而克服了传统机械臂路径规划方法实时性差、计算量大的缺点；(2) The trajectory optimization method of spraying robot based on reinforcement learning of the present invention, which can plan feasible paths in different environments according to different states, and has short decision time and high success rate, can meet the real-time requirements of online planning, Therefore, the shortcomings of the traditional robot arm path planning method, such as poor real-time performance and large amount of calculation, are overcome;

(3)本发明基于强化学习的喷涂机器人轨迹优化方法，强化学习不需要大量的训练数据，且能够生成决定性策略，抗干扰能力强，能够有效地进行轨迹优化。(3) The trajectory optimization method of spraying robot based on reinforcement learning of the present invention does not require a large amount of training data, and can generate decisive strategies, has strong anti-interference ability, and can effectively perform trajectory optimization.

附图说明BRIEF DESCRIPTION

下面结合附图和具体实施方式对本发明作进一步详细的说明。The present invention will be further described in detail below with reference to the drawings and specific embodiments.

图1是本发明基于强化学习的喷涂机器人轨迹优化方法的流程图。FIG. 1 is a flowchart of a trajectory optimization method of a painting robot based on reinforcement learning of the present invention.

图2是双目视觉***图。Figure 2 is a diagram of a binocular vision system.

图3是汇聚式双目视觉理论模型图。Figure 3 is a theoretical model diagram of convergent binocular vision.

图4是双目视觉三维重建***组成图。Fig. 4 is a composition diagram of a binocular vision 3D reconstruction system.

图5是强化学习模型示意图。Figure 5 is a schematic diagram of a reinforcement learning model.

具体实施方式detailed description

下面的实施例可以使本专业的技术人员更全面地理解本发明，但并不因此将本发明限制在所述的实施例范围之中。The following embodiments can enable those skilled in the art to more fully understand the present invention, but do not limit the present invention to the scope of the described embodiments.

实施例Examples

本实施例中基于强化学***行放置,并且要保证基线不能太长,将平行式光轴双目视觉***中的左右相机分别绕光心顺时针和逆时针旋转一定角度,从而形成汇聚式双目视觉***；此***的优点是能够获得更大的视场，大视场的好处就是能够提高计算视差的精度,从而可以提高三维重建的精度。The binocular vision system in the trajectory optimization method of spraying robot based on reinforcement learning in this embodiment, as shown in Figure 2, the left and right cameras can be placed in parallel, and to ensure that the baseline cannot be too long, the parallel optical axis binocular vision system The left and right cameras rotate around the optical center clockwise and counterclockwise by a certain angle to form a convergent binocular vision system; the advantage of this system is that it can obtain a larger field of view, and the advantage of a large field of view is to improve the accuracy of calculating parallax Thus, the accuracy of 3D reconstruction can be improved.

本实施例中基于强化学习的喷涂机器人轨迹优化方法，如图1所示，通过图像获取、图像处理、图形三维重建、图形离散化、选取初始点和选择最优轨迹策略，从而确定出最优喷涂轨迹，其中，所述图像处理，如图4所示，包括摄像机标定、图像校正和立体匹配；具体步骤如下：The trajectory optimization method of spraying robot based on reinforcement learning in this embodiment, as shown in Figure 1, through image acquisition, image processing, three-dimensional reconstruction of graphics, graphics discretization, selecting initial points and selecting the optimal trajectory strategy, to determine the optimal Spraying trajectory, wherein the image processing, as shown in Figure 4, includes camera calibration, image correction and stereo matching; the specific steps are as follows:

步骤2：摄像机标定：对于获取的畸变向量，消除它的径向和切线方向上的镜头畸变，获得无畸变图像，同时建立摄像机的成像模型，确定目标点与像素点之间的对应关系；如图3所示，假设目标点P的三维坐标为(Xw,Yw,Zw)，左摄像机的坐标系为O ₁-X ₁Y ₁Z ₁，图像坐标系为o ₁-x ₁y ₁；右摄像机的坐标系为O ₂-X ₂Y ₂Z ₂，图像坐标系为o ₂-x ₂y ₂；原点O ₁,O ₂分别为左右摄像机的光心；左右摄像机的焦距分别设为f ₁，f ₂； Step 2: Camera calibration: For the obtained distortion vector, remove the lens distortion in the radial and tangent directions to obtain an undistorted image. At the same time, establish an imaging model of the camera to determine the correspondence between the target point and the pixel point; As shown in Fig. 3, assume that the three-dimensional coordinates of the target point P are (Xw, Yw, Zw), the coordinate system of the left camera is O ₁ -X ₁ Y ₁ Z ₁ , and the image coordinate system is o ₁ -x ₁ y ₁ ; The camera coordinate system is O ₂ -X ₂ Y ₂ Z ₂ , and the image coordinate system is o ₂ -x ₂ y ₂ ; the origins O ₁ and O ₂ are the optical centers of the left and right cameras; the focal lengths of the left and right cameras are set to f ₁ , F ₂ ;

步骤5：三维重建：采用三角测量原理计算获取的立体匹配图像的深度值，得到稠密的三维空间点云，再对获取的三维空间点云进行网格化和差值计算，得到物体的三维结构模型；得到物体的三维结构模型为Step 5: 3D reconstruction: use the principle of triangulation to calculate the depth value of the obtained stereo matching image to obtain a dense 3D space point cloud, and then perform gridding and difference calculation on the obtained 3D space point cloud to obtain the 3D structure of the object Model; the three-dimensional structure model of the object is

其中，r ₁,r ₂······r ₉为旋转分量；t _x,t _y,t _z为平移分量，式中Xw、Yw和Zw为三维结构模型中三维坐标的解； Among them, r ₁ , r ₂ ······ r ₉ is the rotation component; t _x ,t _y ,t _z is the translation component, where Xw, Yw and Zw are the solutions of the three-dimensional coordinates in the three-dimensional structure model;

步骤7:选取初始点；Step 7: Pick the initial point;

Q ^π(s,a)表示当前初始状态s下，采取动作a之后依照策略π的决策运动所能获得期望收益，即找到最优策略π ^*，使得 π ^*＝arg max Q ^π(s,a)；在Q学习的基础上，深度Q学习，以神经网络来拟合Q函数(s，a)，其输入状态s可以是连续变量；学习算法如下： Q ^π (s,a) means that in the current initial state s, after taking action a, the decision motion according to strategy π can obtain the expected return, that is, to find the optimal strategy π ^* , such that π ^* =arg max Q ^π (s,a ); On the basis of Q learning, deep Q learning, fitting Q function (s, a) with neural network, the input state s can be continuous variable; the learning algorithm is as follows:

(1)初始化经验回放存储区D；(1) Initialize the experience playback storage area D;

(2)初始化Q网络、目标Q网络(Q′)，其网络权值为随机值；(2) Initialize the Q network and the target Q network (Q′), whose network weights are random values;

(3)开始新的一轮，随机生成环境、喷涂目标，目标点随机出现在空间一定范围内的任意位置，喷枪重置为初试姿态；(3) Start a new round, randomly generate the environment and spray targets, the target points randomly appear at any position within a certain range of the space, and the spray gun is reset to the initial test posture;

(4)以ε的概率选择随机动作a _t，1-ε的概率选择最优动作a _t＝max(s _t,a)，其中s _t为当前状态； (4) Choose a random action a _t with a probability of ε, and choose an optimal action a _t =max(s _t ,a) with a probability of 1-ε, where _st is the current state;

(5)执行动作a _t，得到当前奖励r _t，下一时刻状态s _t+1，将(s _t,a _t,r _t,s _t+1)存入D； (5) perform an action a _t, to obtain the current reward r _t, the next time the state s _{t + 1,} the _{_{(s t, a t, r}} t, s t + 1) is stored in D;

(6)从D中随机采样一批数据，即一批(s _j,a _j,r _j,s _j+1)四元组，令

其中s _T为喷枪达到的目标位置； (6) Randomly sample a batch of data from D, that is, a batch of (s _j , a _j , r _j , s _j+1 ) quadruple, let

Where s _T is the target position reached by the spray gun;

(7)若s _t+1不是最终状态s _T则返回步骤(4)，若s _t+1是最终状态，s _T则更新目标网络Q′,令Q′的网络参数等于Q的网络参数，并返回步骤(3)。 (7) If s _t+1 is not the final state s _T , return to step (4), if s _t+1 is the final state, s _T then update the target network Q′, so that the network parameter of Q′ is equal to the network parameter of Q, And return to step (3).

其中，(5)中的奖励函数r _t分为三个部分，第一部分对重复喷涂进行惩罚，即静止不动，第二部分对喷枪到达目标位置进行奖励，第三部分，对喷枪与目标距离进行奖励或者惩罚，奖励函数为： Among them, the reward function r _t in (5) is divided into three parts, the first part punishes repeated spraying, that is, stationary, the second part rewards the spray gun for reaching the target position, and the third part, the distance between the spray gun and the target To reward or punish, the reward function is:

式中，x _d为当前点坐标，x _e为目标点坐标。 In the formula, x _d is the current point coordinate and x _e is the target point coordinate.

以上显示和描述了本发明的基本原理和主要特征以及本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，本发明还会有各种变化和改进，这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。The above shows and describes the basic principles and main features of the present invention and the advantages of the present invention. Those skilled in the art should understand that the present invention is not limited by the above embodiments. The above embodiments and the description only describe the principles of the present invention. Without departing from the spirit and scope of the present invention, the present invention will have Various changes and improvements that fall within the scope of the claimed invention. The claimed protection scope of the present invention is defined by the appended claims and their equivalents.

Claims

一种基于强化学习的喷涂机器人轨迹优化方法，其特征在于：所述轨迹优化方法通过图像获取、图像处理、图形三维重建、图形离散化、选取初始点和选择最优轨迹策略，确定出最优喷涂轨迹，其中，所述图像处理包括摄像机标定、图像校正和立体匹配；所述方法包括：A trajectory optimization method for spraying robots based on reinforcement learning, characterized in that the trajectory optimization method determines the optimal method through image acquisition, image processing, three-dimensional reconstruction of graphics, discretization of graphics, selection of initial points and selection of optimal trajectory strategy Spraying trajectory, wherein the image processing includes camera calibration, image correction and stereo matching; the method includes:

步骤1：图像获取：采用双目视觉***中的两台相同的相机同时获取同一个场景中的目标物体图像,然后,对获取的图像进行预处理；Step 1: Image acquisition: Two identical cameras in the binocular vision system are used to simultaneously acquire target object images in the same scene, and then, the acquired images are preprocessed;

步骤2：摄像机标定：对于获取的畸变向量，消除它的径向和切线方向上的镜头畸变，获得无畸变图像，同时建立摄像机的成像模型，确定目标点与像素点之间的对应关系；Step 2: Camera calibration: For the obtained distortion vector, remove the lens distortion in the radial and tangent directions to obtain an undistorted image, and at the same time establish an imaging model of the camera to determine the correspondence between the target point and the pixel point;

步骤3：图像校正：通过线性变换使处理后的无畸变图像中的共轭极线位于同一水平线上，把二维空间的匹配问题化简为一维空间的求解问题；Step 3: Image correction: Through linear transformation, the conjugate polar lines in the processed undistorted image are located on the same horizontal line, and the matching problem in two-dimensional space is reduced to the solution problem in one-dimensional space;

步骤4：立体匹配：采用立体匹配算法得到校准后的图像与原图像的视差值，然后利用这个视差值得到每两幅图像之间的稀疏匹配，再通过优化算法，获得稠密匹配；Step 4: Stereo matching: Use the stereo matching algorithm to obtain the disparity value between the calibrated image and the original image, and then use this disparity value to obtain the sparse matching between each two images, and then use the optimization algorithm to obtain a dense match;

步骤5：三维重建：采用三角测量原理计算获取的立体匹配图像的深度值，得到稠密的三维空间点云，再对获取的三维空间点云进行网格化和差值计算，得到物体的三维结构模型；Step 5: 3D reconstruction: use the principle of triangulation to calculate the depth value of the obtained stereo matching image to obtain a dense 3D space point cloud, and then perform gridding and difference calculation on the obtained 3D space point cloud to obtain the 3D structure of the object model;

步骤6：图形离散化：通过双目视觉***对目标物体进行目标获取，对空间进行离散化，即将目标转化为点的集合；Step 6: Discretization of graphics: Obtain the target object through the binocular vision system, discretize the space, that is, convert the target into a set of points;

步骤7:选取初始点；Step 7: Pick the initial point;

步骤8：选择最优轨迹策略：将决策策略π定义为一个函数，它将一个状态映射到一个动作s→a，即当前状态s执行动作a；强化学习的目标函数为：Step 8: Select the optimal trajectory strategy: define the decision strategy π as a function, which maps a state to an action s→a, that is, the current state s executes action a; the objective function of reinforcement learning is:

Q ^π(s,a)＝E(R(s ₀)+γR(s ₁)+γ ²R(s ₂)+...|s ₀＝s,a ₀＝a,π) (1) Q ^π (s,a)=E(R(s ₀ )+γR(s ₁ )+γ ² R(s ₂ )+...|s ₀ =s,a ₀ =a,π) (1)

Q ^π(s,a)表示当前初始状态s下，采取动作a之后依照策略π的决策运动所能获得期望收益，即找到最优策略π ^*，使得π ^*＝arg maxQ ^π(s,a)。 Q ^π (s,a) means that in the current initial state s, after taking action a, the decision motion according to strategy π can obtain the expected return, that is, to find the optimal strategy π ^* , such that π ^* =arg maxQ ^π (s,a) .
根据权利要求1所述的基于强化学习的喷涂机器人轨迹优化方法，其特征在于：所述步骤2中的摄像机标定，假设目标点P的三维坐标为(Xw,Yw,Zw)，左摄像机的坐标系为O ₁-X ₁Y ₁Z ₁，图像坐标系为o ₁-x ₁y ₁；右摄像机的坐标系为O ₂-X ₂Y ₂Z ₂，图像坐标系为o ₂-x ₂y ₂；原点O ₁,O ₂分别为左右摄像机的光心；左右摄像机的焦距分别设为f ₁，f ₂。 The trajectory optimization method for a painting robot based on reinforcement learning according to claim 1, characterized in that: the camera calibration in the step 2 assumes that the three-dimensional coordinates of the target point P are (Xw, Yw, Zw), and the coordinates of the left camera The system is O ₁ -X ₁ Y ₁ Z ₁ , the image coordinate system is o ₁ -x ₁ y ₁ ; the coordinate system of the right camera is O ₂ -X ₂ Y ₂ Z ₂ , the image coordinate system is o ₂ -x ₂ y ₂ ; the origins O ₁ and O ₂ are the optical centers of the left and right cameras, respectively; the focal lengths of the left and right cameras are set to f ₁ and f ₂ , respectively.
根据权利要求2所述的基于强化学习的喷涂机器人轨迹优化方法，其特征在于：采用下式在所述步骤5中得到物体的三维结构模型The trajectory optimization method of a spraying robot based on reinforcement learning according to claim 2, characterized in that the following formula is used to obtain the three-dimensional structure model of the object in the step 5

其中，r ₁,r ₂······r ₉为旋转分量；t _x,t _y,t _z为平移分量，式中Xw、Yw和Zw为三维结构模型中三维坐标的解。 Among them, r ₁ , r ₂ ····· r ₉ is the rotation component; t _x ,t _y ,t _z is the translation component, where Xw, Yw and Zw are the solutions of the three-dimensional coordinates in the three-dimensional structure model.
根据权利要求1所述的基于强化学习的喷涂机器人轨迹优化方法，其特征在于：所述步骤8中，在Q学习的基础上，深度Q学习，以神经网络来拟合Q函数(s，a)，其输入状态s可以是连续变量；The trajectory optimization method of spraying robot based on reinforcement learning according to claim 1, characterized in that: in step 8, on the basis of Q-learning, deep Q-learning is used to fit the Q function (s, a ), the input state s can be a continuous variable;

学习算法如下：The learning algorithm is as follows:

(1)初始化经验回放存储区D；(1) Initialize the experience playback storage area D;

(2)初始化Q网络、目标Q网络(Q′)，其网络权值为随机值；(2) Initialize the Q network and the target Q network (Q′), whose network weights are random values;

(3)开始新的一轮，随机生成环境、喷涂目标，目标点随机出现在空间一定范围内的任意位置，喷枪重置为初试姿态；(3) Start a new round, randomly generate the environment and spray targets, the target points randomly appear at any position within a certain range of the space, and the spray gun is reset to the initial test posture;

(4)以ε的概率选择随机动作a _t，1-ε的概率选择最优动作a _t＝max(s _t,a)，其中s _t为当前状态； (4) Choose a random action a _t with a probability of ε, and choose an optimal action a _t =max(s _t ,a) with a probability of 1-ε, where _st is the current state;

(5)执行动作a _t，得到当前奖励r _t，下一时刻状态s _t+1，将(s _t,a _t,r _t,s _t+1)存入D； (5) perform an action a _t, to obtain the current reward r _t, the next time the state s _{t + 1,} the _{_{(s t, a t, r}} t, s t + 1) is stored in D;

(6)从D中随机采样一批数据，即一批(s _j,a _j,r _j,s _j+1)四元组，令
其中s _T为喷枪达到的目标位置； (6) Randomly sample a batch of data from D, that is, a batch of (s _j , a _j , r _j , s _j+1 ) quadruple, let
Where s _T is the target position reached by the spray gun;

(7)若s _t+1不是最终状态s _T则返回步骤(4)，若s _t+1是最终状态，s _T则更新目标网络Q′,令Q′的网络参数等于Q的网络参数，并返回步骤(3)。 (7) If s _t+1 is not the final state s _T , return to step (4), if s _t+1 is the final state, s _T then update the target network Q′, so that the network parameter of Q′ is equal to the network parameter of Q, And return to step (3).
根据权利要求4所述的基于强化学习的喷涂机器人轨迹优化方法，其特征在于：奖励函数r _t分为三个部分，第一部分对重复喷涂进行惩罚，即静止不动，第二部分对喷枪到达目标位置进行奖励，第三部分，对喷枪与目标距离进行奖励或者惩罚，奖励函数为： The trajectory optimization method of spraying robot based on reinforcement learning according to claim 4, characterized in that: the reward function r _{t is} divided into three parts, the first part punishes repeated spraying, that is, stationary, the second part reaches the spray gun Reward the target position. In the third part, reward or punish the distance between the spray gun and the target. The reward function is:

上式中，x _d为当前点坐标，x _e为目标点坐标。 In the above formula, x _d is the current point coordinate and x _e is the target point coordinate.