WO2024031831A1 - 基于深度强化学习的机械臂装箱与倒库协同方法及*** - Google Patents

基于深度强化学习的机械臂装箱与倒库协同方法及*** Download PDF

Info

Publication number
WO2024031831A1
WO2024031831A1 PCT/CN2022/125985 CN2022125985W WO2024031831A1 WO 2024031831 A1 WO2024031831 A1 WO 2024031831A1 CN 2022125985 W CN2022125985 W CN 2022125985W WO 2024031831 A1 WO2024031831 A1 WO 2024031831A1
Authority
WO
WIPO (PCT)
Prior art keywords
boxing
action
packing
dumping
inversion
Prior art date
Application number
PCT/CN2022/125985
Other languages
English (en)
French (fr)
Inventor
张伟
宋然
宋帅
杨硕
褚石磊
李贻斌
Original Assignee
山东大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 山东大学 filed Critical 山东大学
Publication of WO2024031831A1 publication Critical patent/WO2024031831A1/zh

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1661Programme controls characterised by programming, planning systems for manipulators characterised by task planning, object-oriented languages
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J18/00Arms
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed

Definitions

  • the invention relates to the field of logistics automation technology, and in particular to a robotic arm boxing and warehouse emptying collaboration method and system based on deep reinforcement learning.
  • the heuristic algorithm focuses on packing a set of cubic objects into a limited three-dimensional space (such as a shipping box) with the highest possible space utilization, which is a typical NP-hard combinatorial optimization problem and an online 3D-BPP problem. Assuming that only the information of the object currently to be packed is available, and other object information is unknown, early work mainly focused on designing different heuristic algorithms, such as tabu search algorithm, first adaptation algorithm and extreme point placement algorithm ; The heuristic algorithm is essentially a refinement of the real boxing experience of human workers, and is not perfect for different boxing scenarios.
  • the present invention provides a robotic arm boxing and emptying collaboration method and system based on deep reinforcement learning, and learns the boxing and emptying actions through the Packing and Unloading Network (PUN)
  • PUN Packing and Unloading Network
  • the first aspect of the present invention provides a collaborative method for robotic arm boxing and warehouse dumping based on deep reinforcement learning.
  • a collaborative method for robotic arm binning and warehouse dumping based on deep reinforcement learning including the following processes:
  • the probability distribution of performing the boxing action at all locations on the height map is obtained based on the boxing actor network, and the probability distribution of performing the inverting action at all locations on the height map is obtained based on the inverting actor network;
  • the boxing prediction state value is obtained based on the boxing reviewer network, and the inversion prediction state value is obtained based on the inversion reviewer network;
  • the boxing mask predictor obtains the boxing mask based on heuristic rules
  • the inverted library mask predictor obtains the inverted library mask based on heuristic rules
  • the boxing mask modulates the probability distribution of the boxing action to obtain the boxing position
  • the inversion mask modulates the probability distribution of the inversion action to obtain the inversion position.
  • the maximum value of the boxing prediction state value and the inversion prediction state value is The value is the final action type for boxing or dumping actions.
  • the height map is an L ⁇ W discrete grid, and the value of each cell represents the cumulative height of the object placed at that position.
  • an object n to be packed with size l n ⁇ w n ⁇ h n it is represented by a three-channel size map of L ⁇ W ⁇ 3, and the three The channels are assigned values of l n , w n , h n respectively.
  • each action is parameterized as a basic action executed at a position p, which corresponds to a unit in the height map representation of the product box. grid.
  • the boxing action includes: placing the object on the cell.
  • the coordinates of the left front corner of the object's placement position are consistent with the coordinates of the cell.
  • a 2D coordinate (x n , y n ) is used to represent the position coordinate of the cell.
  • the action (packing, (x n , y n )) means placing object n at the position (x n , y n ) of the height map.
  • the warehouse dumping action includes: p means to remove the object from the cell.
  • the object is located in the product box and the coordinates of the left front corner are the same as the cell coordinates.
  • a temporary object temporary storage area is set up to store up to B dumped objects.
  • the action (unpacking, (x m , y m )) means removing object n from the height map position (x m , y m ) and placing it in the temporary object buffer area.
  • the basic actions and positions of the predicted execution are obtained.
  • the object to be packed comes from the conveyor belt or the object temporary storage area;
  • object m is picked up from the target position p in the product box and placed in the object temporary storage area for temporary storage.
  • object n is continued as the current box to be packed. Objects are used to predict packing or emptying actions until they are placed in product boxes;
  • object m in the object temporary storage area will be used as an object to be packed to predict the action at the next time step. If there are multiple objects in the object temporary storage area, the object will be moved into the temporary storage according to the object being dumped. Select the objects to be packed in the order of the area. Only after all the objects in the object temporary area are cleared, the next object to be boxed can be selected from the conveyor belt.
  • the action constraints are regarded as re-normalized probability distribution ⁇ p (a t ,s t ):
  • N is the number of valid actions
  • s t is the environmental state at time t
  • a t is the action at time t
  • l i is the non-standardized score output by the boxing actor network or the inverted library actor network.
  • r v is the volume utilization rate
  • r w is the space waste rate
  • the volume utilization rate r v is the ratio of the sum of the volumes of all objects in the product box to the total volume of the product box
  • the space waste rate r w is the wasted space in the product box Ratio to the total volume of the product box.
  • the second aspect of the present invention provides a robotic arm packing and emptying collaborative system based on deep reinforcement learning.
  • a collaborative system for robotic arm packing and dumping based on deep reinforcement learning including:
  • the data acquisition module is configured to: obtain the product box height map and object size map;
  • the feature extraction module is configured to: extract features from the product box height map and object size map;
  • the probability distribution generation module is configured as follows: based on the extracted features, the probability distribution of performing the boxing action at all positions on the height map is obtained based on the boxing actor network, and the probability distribution of performing the inversion at all positions on the height map is obtained based on the inversion library actor network. Probability distribution of library actions;
  • the state value generation module is configured to: according to the extracted features, obtain the boxing prediction state value based on the boxing reviewer network, and obtain the inversion prediction state value based on the inversion reviewer network;
  • the mask generation module is configured as follows: according to the product box height map and object size map, the boxing mask predictor obtains the boxing mask based on heuristic rules, and the inverted library mask predictor obtains the inverted library mask based on heuristic rules. code;
  • the action collaboration module is configured as follows: the boxing mask modulates the probability distribution of the boxing action to obtain the boxing position, and the inversion mask modulates the probability distribution of the inversion action to obtain the inversion position, and predicts the state value with the boxing and the maximum value of the predicted state value of the warehouse is the final action type to perform the boxing or warehouse dumping action.
  • a third aspect of the present invention provides a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the robotic arm packing and dumping based on deep reinforcement learning as described in the first aspect of the present invention is implemented. Steps in the collaborative approach.
  • a fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor.
  • the processor executes the program, it implements the method described in the first aspect of the present invention.
  • the deep reinforcement learning-based robotic arm boxing and warehouse dumping collaboration method and system of the present invention introduces the warehouse dumping mechanism into the boxing problem, by removing inappropriately placed objects and placing them in the product box again. Placed in the middle, it greatly improves the final space utilization of the product box.
  • the robotic arm packing and dumping collaboration method and system based on deep reinforcement learning of the present invention proposes a packing and dumping collaboration network (PUN) based on deep reinforcement learning.
  • PUN is a two-branch architecture that generates The status values corresponding to the packing and dumping actions and the execution position of each action. PUN determines whether to perform the packing or dumping action based on the status value.
  • the entire collaborative process of packing and dumping actions is learned under a DRL framework, which improves Accuracy and efficiency of collaborative control.
  • the robot arm boxing and library inversion collaboration method and system based on deep reinforcement learning of the present invention designs the boxing heuristic and the library inversion heuristic, and combines them into the DRL framework to guide the learning of the network , further improving the accuracy and efficiency of collaborative control.
  • Figure 1 is a schematic diagram of the cooperation between packing and emptying operations provided in Embodiment 1 of the present invention.
  • FIG. 2 is a schematic structural diagram of the PUN network provided in Embodiment 1 of the present invention.
  • Figure 3 is a schematic diagram of an example of the collaborative process of packing and emptying provided in Embodiment 1 of the present invention.
  • FIG. 4 is a schematic diagram showing an example of space waste provided by Embodiment 1 of the present invention.
  • Figure 5 is a schematic diagram of the robot packing system provided in Embodiment 1 of the present invention.
  • Figure 6 is a visual rendering of different binning algorithms provided in Embodiment 1 of the present invention.
  • Figure 7 is a learning curve of the PUN variant provided in Embodiment 1 of the present invention.
  • Figure 8 is a rendering of a real scene boxing demonstration provided by Embodiment 1 of the present invention.
  • Embodiment 1 of the present invention provides a collaborative method for robotic arm boxing and warehouse dumping based on deep reinforcement learning, which includes the following processes:
  • this embodiment proposes a boxing and library dumping collaborative network based on deep reinforcement learning, and names it PUN.
  • PUN is a two-branch architecture that generates state values corresponding to the boxing and library dumping actions. and where each action is performed. PUN determines whether to perform boxing or library dumping actions based on the status value.
  • this embodiment also designs a boxing heuristic and a library dumping heuristic, and uses They are incorporated into the DRL framework to guide network learning.
  • This packing task is formulated as a Markov decision process (MDP) that obeys state S, action A, transition probability function P and reward R; use an end-to-end
  • MDP Markov decision process
  • the DRL framework is used to solve this MDP, which seeks a policy ⁇ (a t
  • State representation Model the state observations of product boxes and objects to be packed to represent the state s of the environment; for a product box of size p, discretize its state observation at time t into a height map Z + ⁇ [0,H], as the state representation of the product box; the height map is a discrete grid of L ⁇ W, and the value of each cell represents the cumulative height of the object placed at that position; for the size l n ⁇
  • the state of the object n to be packed in w n ⁇ h n is represented by a three-channel size map of L ⁇ W ⁇ 3. The three channels of the map are assigned values of l n , w n , and h n respectively.
  • Action definition Two types of actions are defined: boxing action and warehouse dumping action. Each action a t is parameterized as a basic action (packing or warehouse dumping) executed at position p, which position p corresponds to the product box.
  • the height map represents a cell in c:
  • (1) Packing Place the object on cell c.
  • the coordinates of the left front corner of the object's placement position correspond to the coordinates of cell c; this embodiment uses a 2D coordinate (x n , y n ) to represent The position coordinates of cell c.
  • the action (packing, (x n , y n )) means placing object n at the position (x n , y n ) of the height map.
  • Unpacking means to remove the object from cell c.
  • the object is located in the product box and the coordinates of the left front corner are the same as the coordinates of cell c.
  • a temporary object storage area is set up to store up to B objects that have been moved out of the library.
  • the action represents removing object n from the height map position (x m , y m ) and placing it in the temporary object buffer area.
  • PUN is a two-branch network structure. It takes as input a height map of product bins and a map of object dimensions, and then outputs the type of the basic action ⁇ and the position p at which it is performed on the height map. After the state input is encoded into features by a shared feature extraction module, it is then input to the binning network and the inverted library network respectively.
  • the boxing and library dumping branches are composed of three parts respectively: actor network, reviewer network and heuristic mask predictor.
  • the actor network and reviewer network of the two branches respectively form the boxing main network and the library dumping main network. .
  • the actor network outputs the probability distribution of executing the basic action ⁇ at all positions on the height map.
  • the two reviewer networks predict the state values V p (s t ) and V u (s t ) respectively. These two values are used to measure the state values in the height map.
  • the heuristic mask predictor generates the boxing mask M p and the inversion mask M u based on the heuristic rules to predict the effective binning and inversion positions respectively.
  • the probability distribution of the actor network output is modulated by the heuristic mask. .
  • PUN For the current state s t , PUN outputs two state values (V p (s t ), V u (s t )) and two probability distributions ( ⁇ p (a t ,s t ), ⁇ u (a t ,s t )); the final action a t consists of the basic action ⁇ and the execution position p.
  • the action corresponding to the highest state value is the final basic action ⁇ , which means that the DRL agent will theoretically receive the highest reward when executing this action; the final execution position p will be sampled from the probability distribution corresponding to the final basic action.
  • PUN predicts the basic action ⁇ and position p to be executed.
  • the object n can come from the conveyor belt or the object temporary storage area. If predicted For the packing action, the agent will grab object n from the conveyor belt (or object temporary area) and place it at the target position p in the product box, and then enter time step t+1 to select the next object to be packed. , if the emptying action is predicted, the agent will pick up the object m from the target position p in the product box and place it in the object temporary storage area for temporary storage.
  • the agent will continue Object n is used as the current object to be packed to predict the packing or emptying action until it is placed in the product box.
  • object m in the object temporary area will be used as an object to be packed to predict the action at the next time step.
  • the agent will select the object to be packed according to the order in which the objects are dumped into the staging area (first in, first out). Only after clearing all objects in the object buffer area can the agent select the next object to be boxed from the conveyor belt.
  • Figure 3 shows an example of the collaborative process of boxing and dumping.
  • the objects to be boxed in Figure 3 represent the objects waiting to be boxed at the current time step, and the serial numbers on the objects represent the order of their boxing.
  • Heuristic-based action constraints Three types of heuristic rules are designed and introduced into the proposed network framework to guide the learning of the network.
  • the first is the physics heuristic, which defines actions that cause boxing and dumping to fail as invalid actions, for example, a boxing action that places an object beyond the boundaries of a product box and a dumping action that is performed at a location where no object is placed.
  • the second is the boxing heuristic, which consists of four currently popular heuristic boxing algorithms, including the extreme point placement algorithm, the remaining maximum space algorithm, the first adaptation algorithm and the layered placement algorithm; the last one is Inversion heuristic, since there are currently few heuristic algorithms for inversion, this embodiment designs two inversion rules. For objects that have been placed in the product box, the position of the object that is on the top layer and whose volume is less than a certain threshold is Defined as the effective dump position.
  • N is the number of valid actions
  • a step-by-step reward is designed for the boxing task, which consists of two parts: volume utilization rate r v and space waste rate r w .
  • the volume utilization rate r v is defined as the sum of the volumes of all objects in the product box and the total volume of the product box. The ratio of:
  • the space waste rate r w is defined as the ratio of the wasted space in the product box to the total volume of the product box, and is used to measure the amount of wasted space in which objects cannot be placed in the current state.
  • the two dotted rectangles marked with serial numbers 1 and 2 are the wasted space in the current state.
  • the reward setting described in this embodiment encourages the execution of warehouse dumping actions that reduce wasted space.
  • the complete reward function is defined as:
  • is the weight parameter of the space waste rate r w .
  • this embodiment establishes a real robot boxing system to deploy the strategy learned in the simulation.
  • Figure 5 shows the robotic cartoning system.
  • the system consists of three work areas: a conveyor belt for transporting objects, product bins for storing objects, and a staging area for temporary storage of objects.
  • a UR5 robotic arm equipped with a suction cup device is used to perform pick and place tasks.
  • Two Intel RealSense D435i cameras are located above the conveyor belt and product box respectively, used to obtain RGB-D image information of these two areas, with a pixel resolution of 1280 ⁇ 720.
  • this embodiment creates three binned datasets, namely DATA-1, DATA-2 and DATA-3.
  • DATA-1 a box of size S d is cut into objects of different sizes along the length, width, and height directions, where d ⁇ x,y,z ⁇ .
  • the size s d of each object does not exceed S d /2.
  • the cut object sequences are randomly shuffled to increase data diversity.
  • DATA-2 and DATA-3 first predefine 64 objects of different sizes, and then cut the box of size S d into objects of several predefined sizes. Then the sequence of cut objects is sorted according to the set rules. Statistics found that the number of objects contained in DATA-2 and DATA-3 is less than that of DATA-1, and the average volume of the corresponding objects is large, which will affect the binning performance of the algorithm on these three data sets.
  • the proposed algorithm is compared with 10 baseline algorithms, which can be divided into two groups.
  • the first group includes 1 random binning strategy and 6 heuristic binning algorithms
  • the second group includes 3 state-of-the-art DRL-based binning algorithms.
  • DRL-based algorithm when there is not enough space to place the current object in the product box, the agent will stop boxing in the current round and enter a new round. Since the algorithms proposed by existing solutions have different variants, this embodiment selects the one with the best performance as the baseline.
  • This embodiment uses three indicators to evaluate the boxing performance.
  • Space utilization is defined as the ratio of the volume of all objects in the product box to the total volume of the product box.
  • Packing quantity is defined as the average number of objects contained in a product box.
  • Standard deviation is defined as the standard deviation of the space utilization metric.
  • Table 1 The experimental results show that the algorithm of this embodiment is better than the baseline algorithm in all evaluation indicators. This shows that compared with other algorithms, the algorithm of this embodiment can make fuller use of the space in the product box and is more robust.
  • the performance of the binning algorithm used on DATA-1 is better than that of DATA-2 and DATA-3.
  • This embodiment also visualizes the binning results of PUN and the two baseline algorithms to qualitatively compare their binning performance.
  • 6 object sequences were randomly selected from DATA-1, and all algorithms used the same object sequence in each test.
  • the visualization results are shown in Figure 6, where the value next to each subgraph represents the space utilization of the algorithm. The results show that the algorithm of this embodiment achieves better binning performance and higher binning utilization.
  • this example creates 3 variants of its design and conducts a series of ablation studies.
  • the first is a single-boxing network variant, for which only the boxed branches of PUN are kept and retrained.
  • the second one is the random library dumping variant, which adds library dumping on the basis of the boxing branch, but the library dumping part is designed as a library dumping module based on random rules. The specific rule is that there are not enough libraries in the product box. After placing the objects in the space, the objects in the box are randomly selected for dumping.
  • the third is a variant without action constraints.
  • the variant is a simplified version that removes the heuristic action constraint part of PUN, but still uses DRL to learn the synergy of boxing and dumping.
  • the performance of the learning-based variant (PUN, no action constraint variant) is better than the random rule-based variant (random library inversion variant). This proves the effectiveness of using the DRL framework proposed in this embodiment to learn the collaboration between boxing and dumping.
  • the binning performance of PUN is better than that of the no-action constraint variant, proving that the action constraints based on heuristic rules provide useful guidance for the training of PUN and improve the binning performance.
  • this embodiment designs three additional reward functions, and compares them on DATA-1 with the reward functions used by the algorithm described in this embodiment.
  • the first is a constant value reward.
  • a fixed value reward of 1 will be obtained.
  • a fixed value reward of 0.5 will be obtained. If the action fails, the reward will be 0. .
  • the second is the ultimate utilization bonus.
  • the binning space utilization of the last time step is recorded and used as a reward for each time step.
  • the last one is the box quantity reward.
  • the number of objects packed in the product box at each time step is used as the reward for that time step.
  • the experimental results are shown in Table 2.
  • the reward function used in this embodiment is better than the other three reward settings in terms of space utilization and number of boxes, which proves the effectiveness of the proposed reward function.
  • This embodiment conducts experiments on the upper limit of the number of objects in the temporary storage area on DATA-1 to explore the impact of different temporary storage area capacities B.
  • the end condition of each packing process is that the number of objects in the buffer area exceeds the upper limit of the capacity of the temporary storage area, or there is no suitable location in the product box for performing the emptying or packing operation.
  • the experimental results are shown in Table 3. The results show that a larger temporary storage area capacity B will lead to higher binning performance. However, when the upper limit of the capacity exceeds 12, the performance improvement of the algorithm in this embodiment is no longer significant.
  • This embodiment demonstrates the proposed algorithm in a real physical scene.
  • the robotic arm uses the learned collaboration between packing and emptying to load objects from the conveyor belt into product boxes.
  • This part includes a total of three test experiments.
  • the first two experiments demonstrated the online packing process of two different types of items (express boxes and commodity boxes).
  • this example adds interference during the robotic arm's packing process, artificially randomly placing a number of items, and then the robotic arm continues packing boxes from this scene.
  • Figure 8 shows the measurement results of the algorithm of this embodiment under these three experiments. It can be seen that different objects are packed into product boxes with high space utilization.
  • the complete boxing process can be found in the supplementary video, which shows that the algorithm of this embodiment reliably completes the online three-dimensional boxing task in different real logistics scenarios.
  • the present invention introduces a library inversion mechanism for the three-dimensional packing problem.
  • this embodiment proposes a Packing and Inversion Network (PUN) to learn the synergy between the boxing action and the inversion action; this embodiment is also based on Human experience proposes the boxing heuristic and the inverted library heuristic, and combines them into PUN to further improve the boxing performance; on this basis, this embodiment designs a real-world robot boxing system to deploy the learned The results of strategies, simulations and real experiments all prove the effectiveness of the method described in this embodiment.
  • PUN Packing and Inversion Network
  • Embodiment 2 of the present invention provides a robotic arm packing and emptying collaborative system based on deep reinforcement learning, including:
  • the data acquisition module is configured to: obtain the product box height map and object size map;
  • the feature extraction module is configured to: extract features from the product box height map and object size map;
  • the probability distribution generation module is configured as follows: based on the extracted features, the probability distribution of performing the boxing action at all positions on the height map is obtained based on the boxing actor network, and the probability distribution of performing the inversion at all positions on the height map is obtained based on the inversion library actor network. Probability distribution of library actions;
  • the state value generation module is configured to: according to the extracted features, obtain the boxing prediction state value based on the boxing reviewer network, and obtain the inversion prediction state value based on the inversion reviewer network;
  • the mask generation module is configured as follows: according to the product box height map and object size map, the boxing mask predictor obtains the boxing mask based on heuristic rules, and the inverted library mask predictor obtains the inverted library mask based on heuristic rules. code;
  • the action collaboration module is configured as follows: the boxing mask modulates the probability distribution of the boxing action to obtain the boxing position, and the inversion mask modulates the probability distribution of the inversion action to obtain the inversion position, and predicts the state value with the boxing and the maximum value of the predicted state value of the warehouse is the final action type to perform the boxing or warehouse dumping action.
  • the working method of the system is the same as the robotic arm binning and warehouse dumping collaboration method based on deep reinforcement learning provided in Embodiment 1, and will not be described again here.
  • Embodiment 3 of the present invention provides a computer-readable storage medium on which a program is stored.
  • the program is executed by a processor, the robotic arm packing and dumping based on deep reinforcement learning as described in Embodiment 1 of the present invention is implemented. Steps in the collaborative approach.
  • Embodiment 4 of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor.
  • the processor executes the program, the implementation is as described in Embodiment 1 of the present invention.
  • embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, etc.) embodying computer-usable program code therein.
  • a computer-usable storage media including, but not limited to, magnetic disk storage, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
  • the program can be stored in a computer-readable storage medium.
  • the program can be stored in a computer-readable storage medium.
  • the process may include the processes of the embodiments of each of the above methods.
  • the storage medium can be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)

Abstract

一种基于深度强化学习的机械臂装箱与倒库协同方法,通过装箱与倒库网络(PUN)来学习装箱动作和倒库动作间的协同作用,提高了装箱精度和效率;基于人类经验提出装箱启发式和倒库启发式,并将其结合到PUN中,进一步提高了装箱精度和效率。还提供了一种基于深度强化学习的机械臂装箱与倒库协同***、计算机可读存储介质和电子设备。

Description

基于深度强化学习的机械臂装箱与倒库协同方法及***
本发明要求于2022年8月9日提交中国专利局、申请号为202210949670.2、发明名称为“基于深度强化学习的机械臂装箱与倒库协同方法及***”的中国专利申请的优先权,其全部内容通过引用结合在本发明中。
技术领域
本发明涉及物流自动化技术领域,特别涉及一种基于深度强化学习的机械臂装箱与倒库协同方法及***。
背景技术
本部分的陈述仅仅是提供了与本发明相关的背景技术,并不必然构成现有技术。
随着电子商务和快递行业的飞速发展,建立智能化的仓储物流***成为了一个新兴的研究热点。在典型的物流装箱场景中,装箱工人从传送带上一件一件的挑选物品,然后将其打包至不同的产品箱中建立产品订单,同时要充分利用每个产品箱内的空间,尽可能的减少产品箱内的体积浪费。为了解决这一任务,大多数研究人员将其表述为在线三维装箱问题(3D-BPP),并提出了许多装箱算法。该问题旨在仅观察当前要装箱的物品的情况下,将一组物品装入尽可能少的三维箱子中。3D-BPP作为一个经典的NP-hard组合优化问题,在一次只能获取一个物体信息的情况下,找到最优的装箱策略十分不容易。
早期的解决方法主要是启发式装箱算法,这种算法本质上是将人类工人的装箱经验建模为规则作为装箱策略,虽然这是解决装箱任务的一种直观方式,通常可以保持良好的工作效率,但建模出的经验规则并不具有普适性,在没有学习过程的情况下,启发式算法不太可能在各种场景中都找到最佳装箱策略。受深度强化学习(DRL)最新进展的启发,研究人员试图通过DRL来解决装箱问题,他们将该问题定义为一个马尔可夫决策过程,通过与环境的不断交互试错获得奖励来学习最优的装箱策略。相对于传统的启发式算法,基于深度强化学习的装箱算法能够学习到更加优化的装箱策略,同时算法的鲁棒性和泛化性也更好。
尽管现有的装箱算法取得了一些不错的进展,但是他们都只关注于如何将当前观察的物品放的更好,却忽略了一个至关重要的人类倒库经验。所谓倒库,就是在当产品箱中没有合适的位置放置当前物品时,通过移走产品箱内的一些摆放不合适的物品来空出更多的空间放置当前物品。在一次只能观测一个待装箱的物品信息的情况下,这种倒库方式对于装箱问题十分重要。
现有的解决3D-BPP的算法主要分为两类:启发式算法和基于深度强化学习的算法。
启发式算法专注于将一组立方体物体以尽可能高的空间利用率装进一个有限的三维空间中(例如一个运输箱),这是一个典型的NP-hard组合优化问题,在线3D-BPP问题假设只有当前待装箱的物体信息是可以获取的,而其他的物体信息是未知的,早期的工作主要是设计不同的启发式算法,例如禁忌搜索算法、首次适应算法和极值点摆放算法;启发式算法本质上是对人类工作者的真实装箱经验的提炼,对于不同的装箱场景来说并不都是完美的。
基于深度强化学习的算法已经被证明可以有效的解决组合优化问题;因此,一些研究人员开始尝试使用DRL来解决3D-BPP;例如,Verma等人开发了一种DRL算法用于解决针对不同尺寸和数量的箱子的三维装箱问题;Zhao等人将3D-BPP表述为一个受约束的马尔可夫决策过程,他们提出一个带约束的DRL框架来解决装箱问题并在仿真环境中演示了先进的装箱性能;Yang等人还尝试将启发式算法与DRL结合来学习更加优化的装箱策略。
但是,然而相比于真正的人类工作者,现有的算法仍有很多缺陷,如果在产品箱中找不到合适的位置放置当前物体,人类工作者可以进行倒库从产品箱中拿出一些物体,从而空出更多的位置放置当前物体,而这是现有的人工智能算法所无法完成的。
发明内容
为了解决现有技术的不足,本发明提供了一种基于深度强化学习的机械臂装箱与倒库协同方法及***,通过装箱与倒库网络(PUN)来学习装箱动作和倒库动作间的协同作用,提高了装箱精度和效率;基于人类经验提出装箱启发式和倒库启发式,并将其结合到PUN中,进一步提高了装箱精度和效率。
为了实现上述目的,本发明采用如下技术方案:
本发明第一方面提供了一种基于深度强化学习的机械臂装箱与倒库协同方法。
一种基于深度强化学习的机械臂装箱与倒库协同方法,包括以下过程:
获取产品箱高度图和物体尺寸映射图;
对产品箱高度图和物体尺寸映射图进行特征提取;
根据提取的特征,基于装箱行动者网络得到在高度图上所有位置执行装箱动作的概率分布,基于倒库行动者网络得到在高度图上所有位置执行倒库动作的概率分布;
根据提取的特征,基于装箱评论者网络得到装箱预测状态值,基于倒库评论者网络得到倒库预测状态值;
根据产品箱高度图和物体尺寸映射图,装箱掩码预测器基于启发式规则得到装箱掩码,倒库掩码预测器基于启发式规则得到倒库掩码;
装箱掩码对装箱动作的概率分布进行调制得到装箱位置,倒库掩码对倒库动作的概率分布进行调制得到倒库位置,以装箱预测状态值和倒库预测状态值的最大值为最终动作类型以进行装箱或者倒库动作。
作为可选的一种实现方式,对于尺寸为p的产品箱,将其在t时刻的状态观测离散化为一个高度图
Figure PCTCN2022125985-appb-000001
Z +∈[0,H],高度图为一个L×W的离散网格,每个单元格的值表示该位置放置物体的累积高度。
作为可选的一种实现方式,对于尺寸为l n×w n×h n的待装箱物体n,表征为一个L×W×3的三通道尺寸映射图,所述尺寸映射图的三个通道分别被赋值为l n,w n,h n
作为可选的一种实现方式,装箱动作和倒库动作中,每个动作被参数化为一个在位置p处执行的基本动作,该位置p对应于产品箱的高度图表征中的一个单元格。
进一步的,装箱动作,包括:将物体放置在单元格上,物体放置位置的左前角坐标与单元格的坐标对应一致,使用一个2D坐标(x n,y n)来表示单元格的位置坐标,动作(packing,(x n,y n))表示将物体n放置在高度图的位置(x n,y n)处。
进一步的,倒库动作,包括:p表示将物体从单元格上移走,该物***于产品箱中且左前角坐标与单元格坐标相同,设置一个临时物体暂存区来储存最多B个被倒库移出的物体,动作(unpacking,(x m,y m))表示从高度图的位置(x m,y m)处移走物体n,并将其放置在临时物体暂存区中。
作为可选的一种实现方式,对于在时间步t下的待装箱物体,得到预测执行的基本动作和位置,待装箱物体来自传送带或者物体暂存区;
如果预测了装箱动作,从传送带或者物体暂存区上抓取待装箱物体并将其放置在产品箱内的目标位置p处,然后进入时间步t+1选择下一个待装箱物体;
如果预测了倒库动作,从产品箱内的目标位置p处吸取物体m,并将其放置在物体暂存区以临时储存,在下一个时间步t+1下,继续物体n作为当前待装箱物体来预测装箱或者倒库动作,直到将其放置到产品箱中;
在放置完物体n后,物体暂存区的物体m将被作为待装箱物体来预测下一个时间步下的动作,如果物体暂存区中有多个物体,根据物体被倒库进暂存区的顺序选择待装箱的物体,只有在清空物体暂存区的所有物体后,才能从传送带上选择下一个待装箱的物体。
作为可选的一种实现方式,对装箱行动者网络和倒库行动者网络,将动作约束视为重新 标准化的概率分布π p(a t,s t):
Figure PCTCN2022125985-appb-000002
其中,N是有效的动作的个数,
Figure PCTCN2022125985-appb-000003
s t为t时刻的环境状态,a t为t时刻的动作,l i为装箱行动者网络或者倒库行动者网络输出非标准化的分数。
作为可选的一种实现方式,装箱任务奖励函数为:R r(s t,a,s t+1)=r v-α×r w
其中,r v为体积利用率,r w为空间浪费率,体积利用率r v为产品箱中所有物体的体积和与产品箱总体积的比值,空间浪费率r w为产品箱内的浪费空间与产品箱总体积的比值。
本发明第二方面提供了一种基于深度强化学习的机械臂装箱与倒库协同***。
一种基于深度强化学习的机械臂装箱与倒库协同***,包括:
数据获取模块,被配置为:获取产品箱高度图和物体尺寸映射图;
特征提取模块,被配置为:对产品箱高度图和物体尺寸映射图进行特征提取;
概率分布生成模块,被配置为:根据提取的特征,基于装箱行动者网络得到在高度图上所有位置执行装箱动作的概率分布,基于倒库行动者网络得到在高度图上所有位置执行倒库动作的概率分布;
状态值生成模块,被配置为:根据提取的特征,基于装箱评论者网络得到装箱预测状态值,基于倒库评论者网络得到倒库预测状态值;
掩码生成模块,被配置为:根据产品箱高度图和物体尺寸映射图,装箱掩码预测器基于启发式规则得到装箱掩码,倒库掩码预测器基于启发式规则得到倒库掩码;
动作协同模块,被配置为:装箱掩码对装箱动作的概率分布进行调制得到装箱位置,倒库掩码对倒库动作的概率分布进行调制得到倒库位置,以装箱预测状态值和倒库预测状态值的最大值为最终动作类型以进行装箱或者倒库动作。
本发明第三方面提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现如本发明第一方面所述的基于深度强化学习的机械臂装箱与倒库协同方法中的步骤。
本发明第四方面提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的程序,所述处理器执行所述程序时实现如本发明第一方面所述的基于深度强化 学习的机械臂装箱与倒库协同方法中的步骤。
与现有技术相比,本发明的有益效果是:
1、本发明所述的基于深度强化学习的机械臂装箱与倒库协同方法及***,将倒库机制引入到装箱问题中,通过移出摆放不合适的物体并将其重新在产品箱中放置,极大的提高了产品箱最终的空间利用率。
2、本发明所述的基于深度强化学习的机械臂装箱与倒库协同方法及***,基于深度强化学习提出了一个装箱倒库协同网络(PUN),PUN为两分支的架构,分别产生装箱和倒库动作对应的状态值以及每个动作的执行位置,PUN根据状态值确定执行装箱还是倒库动作,整个装箱与倒库动作的协同过程在一个DRL框架下学习,提高了协同控制的精度和效率。
3、本发明所述的基于深度强化学习的机械臂装箱与倒库协同方法及***,设计了装箱启发式和倒库启发式,并将它们结合到DRL框架中用于指导网络的学习,进一步的提高了协同控制的精度和效率。
本发明附加方面的优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
构成本发明的一部分的说明书附图用来提供对本发明的进一步理解,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。
图1为本发明实施例1提供的装箱与倒库动作协同的示意图。
图2为本发明实施例1提供的PUN网络结构示意图。
图3为本发明实施例1提供的装箱与倒库协同过程的示例示意图。
图4为本发明实施例1提供的空间浪费的示例示意图。
图5为本发明实施例1提供的机器人装箱***示意图。
图6为本发明实施例1提供的不同装箱算法的可视化效果图。
图7为本发明实施例1提供的PUN变体的学习曲线。
图8为本发明实施例1提供的真实场景装箱演示的效果图。
具体实施方式
下面结合附图与实施例对本发明作进一步说明。
应该指出,以下详细说明都是示例性的,旨在对本发明提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本发明的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。
在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。
实施例1:
本发明实施例1提供了一种基于深度强化学习的机械臂装箱与倒库协同方法,包括以下过程:
如图1所示,将倒库机制引入到装箱问题中,通过移出摆放不合适的物体并将其重新在产品箱中放置来提高最终的空间利用率;为了学习装箱与倒库的协同,本实施例基于深度强化学习提出了一个装箱倒库协同网络,并将其命名为PUN,具体来说,PUN是一个两分支的架构,分别产生装箱和倒库动作对应的状态值以及每个动作的执行位置。PUN根据状态值确定执行装箱还是倒库动作,整个装箱与倒库动作的协同过程在一个DRL框架下学习;同时,本实施例还设计了装箱启发式和倒库启发式,并将它们结合到DRL框架中用于指导网络的学习。
在线三维装箱任务中,物品被传送带一个接一个的运送至目标区域,然后被装箱放置到产品箱中。只有当前待装箱的物体信息是可以观测的,把这个装箱任务公式化为一个服从状态S、动作A、转移概率函数P和奖励R的马尔可夫决策过程(MDP);用一个端到端的DRL框架来求解这个MDP,该框架寻求一个策略π(a t|s t;θ π)来最大化累计期望奖励的和,表示为:
Figure PCTCN2022125985-appb-000004
S1:装箱和倒库协同网络
PUN的网络结构如图2所示,具体细节描述如下:
状态表征:对产品箱和待装箱物体的状态观测进行建模,用来表示环境的状态s;对于尺寸为p的产品箱,将其在t时刻的状态观测离散化为一个高度图
Figure PCTCN2022125985-appb-000005
Z +∈[0,H],作为产品箱的状态表征;该高度图是一个L×W的离散网格,每个单元格的值表示该位置放置物体的累积高度;对于尺寸为l n×w n×h n的待装箱物体n,其状态表征为一个L×W×3的三通道尺寸映射图,该映射图的三个通道分别被赋值为l n,w n,h n
动作定义:定义了两类动作:装箱动作和倒库动作,每个动作a t被参数化为一个在位置p 处执行的基本动作(装箱或倒库),该位置p对应于产品箱的高度图表征中的一个单元格c:
at=(ψ,p)|ψ∈{packing,unpacking},p→c∈Ηt
使用物体的左-前-底角点坐标作为该物体的位置坐标,对物体执行装箱或者倒库动作要求物体的左前角坐标与给定坐标对应一致,基本动作ψ定义如下:
(1)装箱(packing):将物体放置在单元格c上,物体放置位置的左前角坐标与单元格c的坐标对应一致;本实施例使用一个2D坐标(x n,y n)来表示单元格c的位置坐标。动作(packing,(x n,y n))表示将物体n放置在高度图的位置(x n,y n)处。
(2)倒库(unpacking):p表示将物体从单元格c上移走,该物***于产品箱中且左前角坐标与单元格c坐标相同。本实施例设置了一个临时物体暂存区来储存最多B个被倒库移出的物体。动作(unpacking,(x m,y m))表示从高度图的位置(x m,y m)处移走物体n,并将其放置在临时物体暂存区中。
装箱和倒库动作的协同,如图2所示,PUN是一个两分支的网络结构。它将产品箱的高度图和物体尺寸的映射图作为输入,然后输出基本动作的类型ψ及其在高度图上执行的位置p。状态输入由一个共享的特征提取模块编码为特征后,再分别输入到装箱网络和倒库网络。
装箱和倒库分支分别由三个部分组成:行动者网络、评论者网络和启发式掩码预测器,两个分支的行动者网络与评论者网络各自组成装箱主网络和倒库主网络。
行动者网络输出在高度图上所有位置执行基本动作ψ的概率分布,两个评论者网络分别预测状态值V p(s t)和V u(s t),这两个值分别用于衡量在状态s t下采取两个基本动作所获得的累积奖励。启发式掩码预测器基于启发式规则生成装箱掩码M p和倒库掩码M u,分别预测有效的装箱和倒库位置,行动者网络输出的概率分布由启发式掩码来调制。
对于当前状态s t,PUN输出对应于装箱动作和倒库动作的两个状态值(V p(s t),V u(s t))和两个概率分布(π p(a t,s t),π u(a t,s t));最终的动作a t由基本动作ψ和执行位置p组成。最高状态值对应的动作为最终的基本动作ψ,这意味着DRL智能体执行该动作理论上将会获得最高的奖励;最终的执行位置p将从最终基本动作对应的概率分布中采样得到。
完整的装箱与倒库协同过程阐述如下,对于在时间步t下的待装箱物体n,PUN预测执行的基本动作ψ和位置p,该物体n可以来自传送带或者物体暂存区,如果预测了装箱动作, 智能体将从传送带(或者物体暂存区)上抓取物体n并将其放置在产品箱内的目标位置p处,然后进入时间步t+1选择下一个待装箱物体,如果预测了倒库动作,智能体将从产品箱内的目标位置p处吸取物体m,并将其放置在物体暂存区以临时储存,在下一个时间步t+1下,智能体将继续物体n作为当前待装箱物体来预测装箱或者倒库动作,直到将其放置到产品箱中。在放置完物体n后,物体暂存区的物体m将被作为待装箱物体来预测下一个时间步下的动作。如果物体暂存区中有多个物体,智能体将根据物体被倒库进暂存区的顺序(先入先出)来选择待装箱的物体。只有在清空物体暂存区的所有物体后,智能体才能从传送带上选择下一个待装箱的物体。图3展示了装箱与倒库协同过程的一个示例。图3中的待装箱物体表示当前时间步下等待装箱的物体,物体上的序号表示其装箱的顺序。
基于启发式的动作约束:设计了三类启发式规则并将其引入到所提出的网络框架中,用于指导网络的学习。
第一个是物理启发式,它将导致装箱和倒库失败的动作定义为无效动作,例如,将物体放置到超出产品箱边界的装箱动作和在没有物体放置的位置执行的倒库动作;第二个是装箱启发式,它由***下流行的启发式装箱算法组成,包括极值点摆放算法、剩余最大空间算法、首次适应算法和层式摆放算法;最后一个是倒库启发式,由于目前用于倒库的启发式算法较少,本实施例设计了两种倒库规则,对于产品箱内已经放置的物体,将位于顶层且体积小于一定阈值的物体的位置定义为有效倒库位置。
为了将这三种启发式纳入DRL框架中,设计了两个掩码预测器来生成二进制可行性掩码M,用于指示有效和无效的动作;只有满足启发式规则的动作才是有效的。然后本实施例使用可行性掩码来调制装箱和倒库分支的输出;在DRL框架中,行动者网络输出非标准化的分数l(logits),然后通过softmax层的归一化操作将其转化为动作概率分布,对于无效的动作,使用一个非常大的负值-1×10 -8来替换其对应的logit,因此softmax层输出的无效动作的概率几乎为0,将动作约束视为重新标准化的概率分布π p(a t,s t):
Figure PCTCN2022125985-appb-000006
这里N是有效的动作的个数,并且:
Figure PCTCN2022125985-appb-000007
奖励函数:为装箱任务设计了一个分步奖励,包含两部分:体积利用率r v和空间浪费率r w,体积利用率r v定义为产品箱中所有物体的体积和与产品箱总体积的比值:
Figure PCTCN2022125985-appb-000008
空间浪费率r w定义为产品箱内的浪费空间与产品箱总体积的比值,用于衡量当前状态下无法放置物体的浪费空间的大小。
如图4所示,序号所标的1、2两处虚线长方体为当前状态下的浪费空间。本实施例所述的奖励设置鼓励执行减少浪费空间的倒库动作。完整的奖励函数定义为:
R r(s t,a,s t+1)=r v-α×r w
其中,α为空间浪费率r w的权重参数。
S2:真实物流场景部署
为了验证所提出算法的在真实场景中的装箱性能,本实施例建立了一个真实的机器人装箱***来部署在仿真中学习到的策略。
图5展示了机器人装箱***。该***包括三个工作区域:用于运输物体的传送带、用于存放物体的产品箱和用于临时存放物体的暂存区。装配有吸盘装置的UR5机械臂用于执行抓取和放置任务。两个Intel RealSense D435i相机分别位于传送带和产品箱的上方,用于获取这两个区域的RGB-D图像信息,像素分辨率为1280×720。
由于装箱策略是在仿真环境中训练的,因此需要一种转移策略将其应用到真实环境中,最关键的是将仿真和真实环境的输入输出进行统一。PUN的状态表征输入有两个:产品箱高度图和物体尺寸映射图。为了获得产品箱高度图,本实施例先将相机捕获的图像将进行滤波处理,再使用双线性插值的方法将其映射到一个离散的正交网格上。映射后得到的网格与所需要的高度图有相同的分辨率(L×W)。为了获得物体尺寸映射图,本实施例使用背景差分算法对传送带上的物体图像进行边界检测,并计算物体三维尺寸信息。为了简化计算和确保装箱过程的安全,物体的三维尺寸采用向上取整的方式转化为离散值,最后再赋值到一个L×W×3的三维映射图中。
S3:实验结果
S3.1:数据集产生
为了训练和测试所提出的网络,本实施例创建了三个装箱数据集,即DATA-1、DATA-2 和DATA-3。为了生成DATA-1,将一个尺寸为S d的箱子沿着长、宽、高的方向切割成不同大小的物体,其中d∈{x,y,z}。每个物体的尺寸s d不超过S d/2。切割后的物体序列被随机打乱以增加数据的多样性。为了生成DATA-2和DATA-3,首先预定义64中不同尺寸的物体,然后将大小为S d的箱子切割成若干预定义尺寸的物体。然后根据设定的规则对切割后的物体序列进行排序。统计发现,DATA-2和DATA-3中包含的物体数量少于DATA-1,相应的物体的平均体积大,这将会影响算法在这三个数据集上的装箱表现。
S2:装箱性能对比实验
将所提出的算法与10种基线算法进行了比较,这些算法可以分为两组。第一组包括1种随机装箱策略和6种启发式装箱算法,第二组包括3种最先进的基于DRL的装箱算法。对于基于DRL的算法,当产品箱中没有足够的空间放置当前物体时,智能体将停止当前回合的装箱而进入新的回合。由于现有方案提出的算法有不同的变体,本实施例选择性能最好的以一个作为基线。
本实施例使用3个指标来评价装箱性能。空间利用率定义为产品箱中所有物体的体积与产品箱总体积的比值。装箱数量定义为产品箱所装物体的平均个数。标准差定义为空间利用率指标的标准差。实验结果如表1所示,结果表明本实施例的算法在所有评价指标上都由于基线算法。这说明相比于其他算法,本实施例的算法能更充分的利用产品箱内的空间并且鲁棒性更高。此外,由于三个数据集中物体大小和数量的不同,所用到的装箱算法在DATA-1上的性能要优于DATA-2和DATA-3。
表1 不同算法装箱性能比较
Figure PCTCN2022125985-appb-000009
Figure PCTCN2022125985-appb-000010
本实施例还可视化了PUN和两种基线算法的装箱结果,来定性比较其装箱性能。该定性实验中从DATA-1中随机选择了6个物体序列,每次测试时所有算法使用相同的物体序列。可视化结果如图6所示,其中每个子图旁边的值代表该算法的空间利用率。结果表明,本实施例的算法取得了更好的装箱表现,装箱利用率更高。
S3.3消融实验
为了证明PUN的设计选择的有效性,本实施例为其设计创建了3个变体,并进行了一系列消融研究。第一个是单装箱网络变体,对于该变体只保留PUN的装箱分支并重新训练。第二个是随机倒库变体,该变体在装箱分支的基础上加入倒库,但倒库部分被设计为一个基于随机规则的倒库模块,具体规则是在产品箱内没有足够的空间放置物体后随机选择箱内物体进行倒库。第三个是无动作约束变体,改变体是将PUN的启发式动作约束部分移除后的简化版本,但仍是通过DRL来学习装箱与倒库的协同。
本实施例在DATA-1上进行了消融实验并绘制了学习曲线,结果如图7所示。首先,PUN在空间利用率和装箱数量两个装箱评价指标上都优于其他三个变体,证明了所提出的网络设计的有效性。其次,与不带倒库动作的变体(单装箱网络变体)相比,带倒库动作的变体(PUN、无动作约束变体、随机倒库变体)空间利用率和装箱物体个数都更高。这个结果表明倒库动作的加入有效提高了装箱性能,从而证明本实施例所提出的为三维装箱任务引入倒库机制的有效性。第三,对于带倒库动作的三个变体,基于学习的变体(PUN、无动作约束变体)性能要优于基于随机规则的变体(随机倒库变体)。这证明了本实施例提出使用DRL框架来学习装箱与倒库间的协同的有效性。最后,PUN的装箱性能优于无动作约束变体,证明了基于启发式规则的动作约束为PUN的训练提供了有用的指导,并提高了装箱性能。
为了探究不同奖励函数设置对装箱效果的影响,本实施例又另外设计了3种奖励函数,并在DATA-1上与本实施例所述的算法所使用的奖励函数进行了对比。第一个是恒定值奖励。在装箱的每一个时间步下,如果成功执行装箱操作则获得值为1的定值奖励,如果成功执行倒库动作则获得值为0.5的定值奖励,如果动作执行失败则奖励为0。第二个是最终利用率奖励。对于每一次装箱过程,记录最后一个时间步的装箱空间利用率,并将其作为每一个时间步的奖励。最后一个是装箱数量奖励,将每个时间步下产品箱内所装物体的个数作为该时间步的奖励。实验结果如表2所示,本实施例使用的奖励函数在空间利用率和装箱数量上都优于其他三种奖励设置,这证明了所提出奖励函数的有效性。
表2 不同奖励函数的比较
Figure PCTCN2022125985-appb-000011
本实施例在DATA-1上对暂存区的物体数量上限进行了实验,以探究不同暂存区容量B的影响。对于该实验,每次装箱过程的结束条件是缓存区的物体数量超过暂存区容量上限,或者产品箱内没有合适的位置用于执行倒库或者装箱操作。实验结果如表3所示,结果表明较大的暂存区容量B会导致更高的装箱表现。但当容量上限超过12之后,本实施例算法的性能提升不再显著。
表3 不同暂存区容量上限的比较
Figure PCTCN2022125985-appb-000012
S3.4:真实场景装箱演示
本实施例在真实的物理场景种演示了所提出的算法,机械臂利用学到的装箱与倒库间的协同,将物体从传送带装进产品箱中。此部分共包括三个测试实验。前两个实验演示了两类不同物品(快递盒、商品盒)的在线装箱过程。第三个实验本实施例在机械臂装箱的中途加 入干扰,人为的随机摆放了若干物品,然后机械臂从这个场景下继续装箱。图8展示了本实施例算法在这三个实验下的测示结果。可以看出,不同的物体都被以较高的空间利用率装进产品箱中。在补充视频中可以找到完整的装箱过程,这表明本实施例的算法可靠地完成了不同真实物流场景下的在线三维装箱任务。
本发明为三维装箱问题引入倒库机制,在此基础上,本实施例提出了装箱与倒库网络(PUN)来学习装箱动作和倒库动作间的协同作用;本实施例还基于人类经验提出装箱启发式和倒库启发式,并将其结合到PUN中来进一步提高装箱性能;在此基础上,本实施例设计了一个真实世界的机器人装箱***来部署学习到的策略,仿真和真实实验的结果都证明了本实施例所述方法的有效性。
实施例2:
本发明实施例2提供了一种基于深度强化学习的机械臂装箱与倒库协同***,包括:
数据获取模块,被配置为:获取产品箱高度图和物体尺寸映射图;
特征提取模块,被配置为:对产品箱高度图和物体尺寸映射图进行特征提取;
概率分布生成模块,被配置为:根据提取的特征,基于装箱行动者网络得到在高度图上所有位置执行装箱动作的概率分布,基于倒库行动者网络得到在高度图上所有位置执行倒库动作的概率分布;
状态值生成模块,被配置为:根据提取的特征,基于装箱评论者网络得到装箱预测状态值,基于倒库评论者网络得到倒库预测状态值;
掩码生成模块,被配置为:根据产品箱高度图和物体尺寸映射图,装箱掩码预测器基于启发式规则得到装箱掩码,倒库掩码预测器基于启发式规则得到倒库掩码;
动作协同模块,被配置为:装箱掩码对装箱动作的概率分布进行调制得到装箱位置,倒库掩码对倒库动作的概率分布进行调制得到倒库位置,以装箱预测状态值和倒库预测状态值的最大值为最终动作类型以进行装箱或者倒库动作。
所述***的工作方法与实施例1提供的基于深度强化学习的机械臂装箱与倒库协同方法相同,这里不再赘述。
实施例3:
本发明实施例3提供了一种计算机可读存储介质,其上存储有程序,该程序被处理器执行时实现如本发明实施例1所述的基于深度强化学习的机械臂装箱与倒库协同方法中的步骤。
实施例4:
本发明实施例4提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理 器上运行的程序,所述处理器执行所述程序时实现如本发明实施例1所述的基于深度强化学习的机械臂装箱与倒库协同方法中的步骤。
本领域内的技术人员应明白,本发明的实施例可提供为方法、***、或计算机程序产品。因此,本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种基于深度强化学习的机械臂装箱与倒库协同方法,其特征在于:
    包括以下过程:
    获取产品箱高度图和物体尺寸映射图;
    对产品箱高度图和物体尺寸映射图进行特征提取;
    根据提取的特征,基于装箱行动者网络得到在高度图上所有位置执行装箱动作的概率分布,基于倒库行动者网络得到在高度图上所有位置执行倒库动作的概率分布;
    根据提取的特征,基于装箱评论者网络得到装箱预测状态值,基于倒库评论者网络得到倒库预测状态值;
    根据产品箱高度图和物体尺寸映射图,装箱掩码预测器基于启发式规则得到装箱掩码,倒库掩码预测器基于启发式规则得到倒库掩码;
    装箱掩码对装箱动作的概率分布进行调制得到装箱位置,倒库掩码对倒库动作的概率分布进行调制得到倒库位置,以装箱预测状态值和倒库预测状态值的最大值为最终动作类型以进行装箱或者倒库动作。
  2. 如权利要求1所述的基于深度强化学习的机械臂装箱与倒库协同方法,其特征在于:
    对于尺寸为p的产品箱,将其在t时刻的状态观测离散化为一个高度图
    Figure PCTCN2022125985-appb-100001
    Z +∈[0,H],高度图为一个L×W的离散网格,每个单元格的值表示该位置放置物体的累积高度;
    或者,
    对于尺寸为l n×w n×h n的待装箱物体n,表征为一个L×W×3的三通道尺寸映射图,所述尺寸映射图的三个通道分别被赋值为l n,w n,h n
  3. 如权利要求1所述的基于深度强化学习的机械臂装箱与倒库协同方法,其特征在于:
    装箱动作和倒库动作中,每个动作被参数化为一个在位置p处执行的基本动作,该位置p对应于产品箱的高度图表征中的一个单元格。
  4. 如权利要求3所述的基于深度强化学习的机械臂装箱与倒库协同方法,其特征在于:
    装箱动作,包括:将物体放置在单元格上,物体放置位置的左前角坐标与单元格的坐标对应一致,使用一个2D坐标(x n,y n)来表示单元格的位置坐标,动作(packing,(x n,y n))表示将物体n放置在高度图的位置(x n,y n)处;
    或者,
    倒库动作,包括:p表示将物体从单元格上移走,该物***于产品箱中且左前角坐标与单元格坐标相同,设置一个临时物体暂存区来储存最多B个被倒库移出的物体,动作(unpacking,(x m,y m))表示从高度图的位置(x m,y m)处移走物体n,并将其放置在临时物体暂存区中。
  5. 如权利要求1所述的基于深度强化学习的机械臂装箱与倒库协同方法,其特征在于:
    对于在时间步t下的待装箱物体,得到预测执行的基本动作和位置,待装箱物体来自传送带或者物体暂存区;
    如果预测了装箱动作,从传送带或者物体暂存区上抓取待装箱物体并将其放置在产品箱内的目标位置p处,然后进入时间步t+1选择下一个待装箱物体;
    如果预测了倒库动作,从产品箱内的目标位置p处吸取物体m,并将其放置在物体暂存区以临时储存,在下一个时间步t+1下,继续物体n作为当前待装箱物体来预测装箱或者倒库动作,直到将其放置到产品箱中;
    在放置完物体n后,物体暂存区的物体m将被作为待装箱物体来预测下一个时间步下的动作,如果物体暂存区中有多个物体,根据物体被倒库进暂存区的顺序选择待装箱的物体,只有在清空物体暂存区的所有物体后,才能从传送带上选择下一个待装箱的物体。
  6. 如权利要求1所述的基于深度强化学习的机械臂装箱与倒库协同方法,其特征在于:
    对装箱行动者网络和倒库行动者网络,将动作约束视为重新标准化的概率分布π p(a t,s t):
    Figure PCTCN2022125985-appb-100002
    其中,N是有效的动作的个数,
    Figure PCTCN2022125985-appb-100003
    s t为t时刻的环境状态,a t为t时刻的动作,l i为装箱行动者网络或者倒库行动者网络输出非标准化的分数。
  7. 如权利要求1所述的基于深度强化学习的机械臂装箱与倒库协同方法,其特征在于:
    装箱任务奖励函数为:R r(s t,a,s t+1)=r v-α×r w
    其中,α为空间浪费率r w的权重参数,r v为体积利用率,r w为空间浪费率,体积利用率r v为产品箱中所有物体的体积和与产品箱总体积的比值,空间浪费率r w为产品箱内的浪费空间与产品箱总体积的比值。
  8. 一种基于深度强化学习的机械臂装箱与倒库协同***,其特征在于:
    包括:
    数据获取模块,被配置为:获取产品箱高度图和物体尺寸映射图;
    特征提取模块,被配置为:对产品箱高度图和物体尺寸映射图进行特征提取;
    概率分布生成模块,被配置为:根据提取的特征,基于装箱行动者网络得到在高度图上所有位置执行装箱动作的概率分布,基于倒库行动者网络得到在高度图上所有位置执行倒库动作的概率分布;
    状态值生成模块,被配置为:根据提取的特征,基于装箱评论者网络得到装箱预测状态值,基于倒库评论者网络得到倒库预测状态值;
    掩码生成模块,被配置为:根据产品箱高度图和物体尺寸映射图,装箱掩码预测器基于启发式规则得到装箱掩码,倒库掩码预测器基于启发式规则得到倒库掩码;
    动作协同模块,被配置为:装箱掩码对装箱动作的概率分布进行调制得到装箱位置,倒库掩码对倒库动作的概率分布进行调制得到倒库位置,以装箱预测状态值和倒库预测状态值的最大值为最终动作类型以进行装箱或者倒库动作。
  9. 一种计算机可读存储介质,其上存储有程序,其特征在于,该程序被处理器执行时实现如权利要求1-7任一项所述的基于深度强化学习的机械臂装箱与倒库协同方法中的步骤。
  10. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的程序,其特征在于,所述处理器执行所述程序时实现如权利要求1-7任一项所述的基于深度强化学习的机械臂装箱与倒库协同方法中的步骤。
PCT/CN2022/125985 2022-08-09 2022-10-18 基于深度强化学习的机械臂装箱与倒库协同方法及*** WO2024031831A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210949670.2 2022-08-09
CN202210949670.2A CN115284287A (zh) 2022-08-09 2022-08-09 基于深度强化学习的机械臂装箱与倒库协同方法及***

Publications (1)

Publication Number Publication Date
WO2024031831A1 true WO2024031831A1 (zh) 2024-02-15

Family

ID=83828443

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/125985 WO2024031831A1 (zh) 2022-08-09 2022-10-18 基于深度强化学习的机械臂装箱与倒库协同方法及***

Country Status (2)

Country Link
CN (1) CN115284287A (zh)
WO (1) WO2024031831A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200061811A1 (en) * 2018-08-24 2020-02-27 Nvidia Corporation Robotic control system
CN111331607A (zh) * 2020-04-03 2020-06-26 山东大学 一种基于机械臂的自主抓取与码垛方法及***
CN113592855A (zh) * 2021-08-19 2021-11-02 山东大学 基于启发式深度强化学习的自主抓取与装箱方法及***
CN113664825A (zh) * 2021-07-19 2021-11-19 清华大学深圳国际研究生院 一种基于强化学习的堆叠场景机械臂抓取方法与装置
CN114800512A (zh) * 2022-05-10 2022-07-29 上海交通大学 一种基于深度强化学习的机器人推拨装箱方法及***

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109969674B (zh) * 2019-04-01 2020-05-05 北京极智嘉科技有限公司 一种机器人控制***和方法、计算设备及存储介质
JP7273692B2 (ja) * 2019-11-01 2023-05-15 株式会社東芝 制御装置、制御方法およびプログラム
CN111168684B (zh) * 2020-01-22 2022-07-26 西北工业大学深圳研究院 一种空间大型结构在轨装配序列规划方法
CN113524173B (zh) * 2021-06-17 2022-12-27 北京控制工程研究所 一种端到端的地外探测样品智能抓取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200061811A1 (en) * 2018-08-24 2020-02-27 Nvidia Corporation Robotic control system
CN111331607A (zh) * 2020-04-03 2020-06-26 山东大学 一种基于机械臂的自主抓取与码垛方法及***
CN113664825A (zh) * 2021-07-19 2021-11-19 清华大学深圳国际研究生院 一种基于强化学习的堆叠场景机械臂抓取方法与装置
CN113592855A (zh) * 2021-08-19 2021-11-02 山东大学 基于启发式深度强化学习的自主抓取与装箱方法及***
CN114800512A (zh) * 2022-05-10 2022-07-29 上海交通大学 一种基于深度强化学习的机器人推拨装箱方法及***

Also Published As

Publication number Publication date
CN115284287A (zh) 2022-11-04

Similar Documents

Publication Publication Date Title
CN104408589B (zh) 基于混合粒子群算法的agv优化调度方法
Hartmann A general framework for scheduling equipment and manpower at container terminals
CN113592855B (zh) 基于启发式深度强化学习的自主抓取与装箱方法及***
CN110363380B (zh) 一种集装箱堆场双场桥动态协同调度方法
Osaba et al. AMCPA: A population metaheuristic with adaptive crossover probability and multi-crossover mechanism for solving combinatorial optimization problems
CN112180934A (zh) 装车机器人的控制方法、***、装置及可读存储介质
Luo et al. A* guiding DQN algorithm for automated guided vehicle pathfinding problem of robotic mobile fulfillment systems
Jin et al. Container stacking optimization based on Deep Reinforcement Learning
WO2024031831A1 (zh) 基于深度强化学习的机械臂装箱与倒库协同方法及***
Song et al. Towards online 3d bin packing: Learning synergies between packing and unpacking via drl
Hirashima et al. A Q-learning for group-based plan of container transfer scheduling
CN110203708B (zh) 一种不完全提箱信息下的集装箱堆场翻箱落位优选方法
Lin et al. A TABU SEARCH ALGORITHM TO MINIMIZE TOTAL WEIGHTED TARDINESS FOR THE JOB SHOP SCHEDULING PROBLEM.
Hirashima et al. A new reinforcement learning for group-based marshaling plan considering desired layout of containers in port terminals
Kong et al. Scheduling of automated guided vehicles for tandem quay cranes in automated container terminals
WO2022142654A1 (zh) 一种放置位置的获取方法、模型的训练方法以及相关设备
Suárez-Hernández et al. Practical resolution methods for mdps in robotics exemplified with disassembly planning
Hayashi et al. Assembly sequence optimization of spatial trusses using graph embedding and reinforcement learning
Mohammed et al. Learning Pick to Place Objects using Self-supervised Learning with Minimal Training Resources
Park et al. PLANNING FOR SELECTIVE REMARSHALING IN AN AUTOMATED CONTAINER TERMINAL USING COEVOLUTIONARY ALGORITHMS.
Liu et al. Fuzzy optimization of storage space allocation in a container terminal
CN113988443B (zh) 一种基于深度强化学习的自动化码头协同调度方法
Hirashima A reinforcement learning with group-based candidate-extraction for container marshalling at marine ports
Cho et al. Locating algorithm of steel stock area with asynchronous advantage actor-critic reinforcement learning
Hao et al. Motion coordination of multiple robots based on deep reinforcement learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22954773

Country of ref document: EP

Kind code of ref document: A1