CN117114524B

CN117114524B - Logistics sorting method based on reinforcement learning and digital twin

Info

Publication number: CN117114524B
Application number: CN202311369261.6A
Authority: CN
Inventors: 黄川�; 崔曙光; 张崴; 李然
Original assignee: Chinese University of Hong Kong Shenzhen
Current assignee: Chinese University of Hong Kong Shenzhen
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-01-26
Anticipated expiration: 2043-10-23
Also published as: CN117114524A

Abstract

The invention discloses a logistics sorting method based on reinforcement learning and digital twin bodies, which comprises the following steps: s1, acquiring historical cargo data in a logistics sorting system; s2, collecting historical sorting data of sorting grids of a sorting machine in a logistics sorting system, and fitting a grid processing efficiency function; s3, integrating package card information through a clustering algorithm to obtain a package card category similarity matrix and a transition probability matrix; s4, designing a reinforcement learning strategy and a value network, and constructing leaf nodes of the Monte Carlo tree; s5, obtaining an optimal grid sorting strategy by expanding leaf nodes of the Monte Carlo tree; s6, constructing a digital twin body for the logistics sorting systems of different logistics transfer fields, and acquiring an optimal grid sorting strategy. According to the invention, the number of the grid-locking cargoes and the grid-locking time data are counted respectively, a Monte Carlo tree search reinforcement learning algorithm is adopted, the generalization of a sorting plan is improved, and the method is suitable for a transfer field logistics sorting system with different site condition factors.

Description

Logistics sorting method based on reinforcement learning and digital twin

Technical Field

The invention relates to the field of logistics sorting, in particular to a logistics sorting method based on reinforcement learning and digital twin.

Background

Existing logistics sorting methods are generally based on manual sorting experience rules of a sorting machine, and are modeled by analyzing historical packing rules, grid distribution and historical sorting data. Such modeling methods have a number of drawbacks: first, when efficiency evaluation and predictive analysis are carried out on a conveyor belt and a grid of a sorting machine, efficiency factors of manual sorting are limited, modeling accuracy and prediction accuracy are not high, and modeling efficiency is low. Secondly, the existing sorting plan optimization can only aim at specific historical shifts of specific sites at a time, and the sorting plan is generally adjusted by means of human experience, so that the generalization of the digital twin model is weaker, and the digital twin model is difficult to adapt to different shift conditions of a plurality of transit sites.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a logistics sorting method based on reinforcement learning and digital twin bodies.

The aim of the invention is realized by the following technical scheme: a reinforcement learning and digital twins based stream sorting method comprising the steps of:

s1, acquiring historical cargo data in a logistics sorting system;

s2, collecting historical sorting data of sorting grids of a sorting machine in a logistics sorting system, and fitting a grid processing efficiency function;

s3, integrating package card information through a clustering algorithm to obtain a package card category similarity matrix and a transition probability matrix;

s4, designing a reinforcement learning strategy and a value network based on the similarity and the transition probability matrix of the package card category, and constructing leaf nodes of the Monte Carlo tree;

s5, obtaining an optimal grid sorting strategy by expanding leaf nodes of the Monte Carlo tree;

s6, constructing a digital twin body for the logistics sorting systems of different logistics transfer fields, simulating in the digital twin body to obtain historical cargo data and historical sorting data in the logistics transfer fields, and dynamically adjusting Monte Carlo trees according to the steps S1-S5 to obtain an optimal grid sorting strategy of the logistics sorting system in the current logistics transfer field.

The beneficial effects of the invention are as follows: based on the historical sorting data and the current sorting data of the logistics transit site, the number of the grid-locking cargoes and the grid-locking time data are respectively counted, and sorting efficiency functions of all grids are fitted through historical grid sorting information. According to the method, detailed analysis of distribution information of personnel in each site is not needed, only the historical sorting information of the grid openings is collected, the uncertain artificial sorting efficiency is converted into quantifiable data taking the grid openings as units, and the modeling efficiency is improved. The Monte Carlo tree search reinforcement learning algorithm is adopted, so that generalization of the optimized sorting plan is improved, and the sorting system can be suitable for the transfer field logistics sorting systems with different field condition factors.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical solution of the present invention will be described in further detail with reference to the accompanying drawings, but the scope of the present invention is not limited to the following description.

Aiming at the constraint conditions of different transfer sites and a plurality of sorting optimization targets, the invention adopts a logistics sorting method based on reinforcement learning and digital twinning. The sorting grid is used as the minimum unit of sorting and bagging of the logistics sorting machine, the goods package plate is used as the goods clustering of the goods flow direction and the aging information formulated by the transfer field, and the mapping relation between the goods package plate and the goods clustering greatly influences the sorting efficiency of the whole logistics. The method is characterized in that historical sorting data of a large number of sorting grids of the logistics sorting machine are collected, a grid processing efficiency function is fitted, and the sorting system is directly modeled by combining information such as the conveying speed of a conveying belt of the sorting machine and the loading speed. Considering that in the logistics sorting process, each sorting machine grid has a storage space with a fixed capacity, and the storage space reaches the upper limit, the manual grid locking, packaging and emptying treatment can be carried out. So in the process of logistics sorting, the locking efficiency of the sorting lattice opening determines the cargo handling efficiency of the logistics sorting system. As shown in fig. 1, the method comprises the steps of:

a reinforcement learning and digital twins based stream sorting method comprising the steps of:

s1, acquiring historical cargo data in a logistics sorting system;

historical cargo data in the logistics sorting system comprises: in the set time, package card number information, package card flow information and package card aging information of each package in the logistics sorting system are included; the package card flow information comprises package flow city codes; the ageing information of the package card comprises aviation parts and land transportation parts.

At this time, the preprocessing of the historical cargo data may further include: according to the aging information and the flow direction information of each package card, primarily classifying the packages:

primarily classifying packages into two types of aviation parts (fast) and land transportation parts (slow) according to aging information;

and then classifying the flow direction of each type of package according to the city codes according to the package flow direction information.

the historical sorting data of the sorting grid of the sorting machine in the step S2 includes: the number of the sorting grid openings, the grid locking times of the sorting grid openings and the grid locking time of each sorting grid opening when each grid is locked are set in the set time T, and the sorting grid opening locking comprises the number of packages;

the step S2 includes:

s201, based on historical grid locking time and grid locking number data, fitting grid locking number functions of all grids respectivelyLock grid time function>: setting the grid locking times of each sorting grid opening to be M in the time T;

a1, counting the number of packages contained in the sorting grid openings with each number at the moment when the ith grid is locked, fitting a relation function between the number of packages and the number of the sorting grid openings by using a discount approximation method, and marking asIn the function ofxIndicates sorting lattice number,/->Is numbered asxThe number of packages corresponding to the grid openings,x=1,2,…,X，Xindicates the number of sorting grids, +.>The method is used for representing the relation between the number of packages obtained by fitting and the sorting grid number;

counting the grid locking time of each numbered sorting grid, fitting the relation function of the grid locking time and the sorting grid number by using a recurrence approximation method, and marking asIn the function ofxIndicates sorting lattice number,/->Is numbered asxThe grid locking time length corresponding to the grid opening,x=1,2,…,X，Xindicates the number of sorting grids, +.>The method is used for representing the relation between the grid locking time length obtained by fitting and the sorting grid port number;

a2, when i=1, 2, …, M, repeatedly executing A1 to obtain each time of locking the gridAnd->Thus obtainingAnd->Averaging to obtain a relation function of the number of packages and the number of sorting grids, and marking the relation function as +.>、/>；

S202, calculating a grid processing efficiency function。

The set time T is historical one week time;

the step S3 includes:

s301, using aging information and flow direction information in package card information of each package as feature vectors: assuming that the aging characteristic of the package card is z, the flow direction characteristic is w, and the characteristic vector of the package card is%z , w)；

Setting the number N of clusters to be equal to the number X of the grid openings, and clustering the package card feature vectors through a clustering algorithm (the unsupervised clustering algorithm of the clustering algorithm generally adopts Kmeans++), so as to obtain the package card feature vectorsNThe package card categories are recorded as:the cluster center of each package card category is marked as +.>And counting the proportion of the packages of each package card category to the total packages:；

feature vectors defining individual package card categoriesWherein->Representing feature vectors of k package card categories, wherein +.>Indicating the aging characteristics contained in the cluster center of the nth pack category; />Flow direction characteristics of the cluster center of the nth package card category are represented, and the package card category is +.>Indicating the proportion of packages of the nth package card category to the total packages,；

s302, by calculationConstructing a similarity matrix between the nth package card category and all package card categories with Euclidean distance between feature vectors of each package card category>The matrix is a 1*N matrix, and the kth column in the matrix representsAnd->European distance,/, of->When n=k, the calculated euclidean distance is 0;

after the similarity matrix is standardized, the method is obtainedAcquiring a state transition matrix between the current package card category and all package card categories>，/>Representing modulo, which is also a 1*N matrix;

s303, atRepeatedly executeStep S303, obtaining a similarity matrix and a state transition matrix corresponding to each package card category.

s401, similarity matrix according to package card categoriesState transition matrix->Designing a reinforcement learning strategy network>：

Design state-action setn , a) The state n represents the package card category n corresponding to the current material flow sorting grid, and the actionaSelecting a package card category to be placed for representing the next logistics sorting grid number;

for the current state n: the estimated capacity value of the strategy network of the current logistics sorting grid arrangement is as follows:wherein->Representation->The element of column k; the transfer probability of the package card category selection to be arranged by the next logistics sorting grid number is as follows: />；/>Representation->The kth column element of (a);

s402, design logistics sorting gridEstimated capacity value networkFor state-action setn , a) Calculating the estimated capacity value of the estimated capacity value network under the condition that the package card class n corresponding to the current material flow sorting grid is met:

the estimated productivity value networkIn, first traverse->And at each value of k, calculate +.>The calculation results of k at each value are summed and the result obtained is recorded as +.>The estimated capacity value of the capacity value network is estimated;

s403, constructing leaf nodes of a Monte Carlo tree:

wherein each leaf node corresponds to one sorting grid, and each sorting grid only has one package of package card category,P(n,a) Is the previous transition probability, its value is equal to，/>For the evaluation initial value before the expansion of the current leaf node of the Monte Carlo tree,/is given>；/>Estimating an expanded reward function for the current leaf node of the Monte Carlo tree; />For the number of accesses before the expansion of the current leaf node of the Monte Carlo tree, +.>The number of selection times after the current leaf node of the Monte Carlo tree is unfolded;Q(n , a) The weighted value average for the current node:

；

is a preset weight;

when the maximum estimated productivity value is estimated for each Monte Carlo tree node, carrying out fixed times or simulated material flow sorting grid distribution allocation limiting fixed time under the current package card category state n so as to enable a strategy networkFor the estimated productivity value of the logistics sorting grid under the package card type state n +.>Approximation value network->For the estimated productivity value of the logistics sorting grid under the package card type state n +.>。

in the embodiment of the present application, in step S5, the nodes of the monte carlo tree are expanded according to the procedures of selecting, expanding, evaluating and backtracking, where:

the selection flow is as follows:

the method comprises the steps of giving the position of a root node, searching child node information, and selecting a rule according to polynomial confidence to obtain the current optimal child node, wherein the policy is as follows:

wherein the method comprises the steps ofThe constant is used for determining the expansion degree when selecting the child nodes, and nodes with higher selection probability are initially selected in the selection process, but nodes with higher value are gradually selected in the expansion process of the Monte Carlo tree;

the expansion flow is as follows:

in the process of developing the Monte Carlo tree, a strategy network in S4 is usedPolicy network->Giving the probability of previous transfer according to the package card category transfer matrixP(n , a) Executing until the next leaf node of the Monte Carlo tree or until the full expansion of the Monte Carlo tree;

the evaluation flow is as follows:

starting from leaf nodes in the process of expanding the Monte Carlo tree until the Monte Carlo tree is fully expanded, and obtaining a reward value selected by the current branch according to the wrapping grid distribution and the reward function of the current branch when the Monte Carlo tree is fully expanded;

the construction of the rewarding function is based on queuing theory, and the average speed on the conveyor belt is obtained according to the physical data of the hardware facilities of the transfer field sorting systemvAccording to the proportion of each package card category to the total package, the package card isAnd the total wrapping proportion of each package card category is independent and distributed, so that the corresponding grid arrival rate of each package card category is obtained:

based on the grid port processing efficiency function obtained according to the historical grid port locking time and the grid port locking number data in S2And the grid occupied by each package card category according to the distributionxObtaining the team length of each package card category in logistics sorting:

by constructing a reward functionEach branch evaluation process of the Monte Carlo tree is greatly accelerated, the simulation times of a transfer logistics sorting digital twin system in calling are greatly reduced, and the sorting optimization speed based on historical data is improved;

the backtracking process is as follows:

after the Monte Carlo tree is fully expanded, the selection times are updated upwards from the lowest node of the Monte Carlo tree，/>And corresponding leaf node evaluation value and bonus value under expansion strategy +.>，/>Until the current leaf node, calculating to obtain the weighted value average value of the current leaf node through a backtracking flowQ(n , a)：

After the expansion exploration interaction of the multi-round Monte Carlo tree is carried out, when the change of the accumulated prize value of all the current nodes is lower than a preset threshold value, the expansion path of the Monte Carlo tree of the current root node is an optimal path, namely an optimal grid sorting strategy.

Because each leaf node corresponds to one sorting bin, an optimal bin sorting strategy is obtained.

The construction logistics transfer field logistics sorting digital twin body is based on historical data (such as the time length of locking the logistics sorting lattice and the information that the locking lattice is the number of packages) and a common operation mechanism model of the series fusion logistics transfer field sorting machine comprises: conveyor belt model, sorting grid model, parcel shelf model, parcel scan model, etc. Because the site conditions of the transfer sites differ from one stream to another, when constructing a digital twin for the transfer site stream sorting in a stream, variable parameters are added to the common operating mechanism model, such as: the number difference of the grid openings, the number difference of goods shelves on the package (influencing the package loading rate), the fixed rate difference of the conveyor belt (influencing the rate of the package falling into the sorting grid openings on the conveyor belt) and the like. The flow lattice sorting strategy obtained by Monte Carlo tree search used in a fixed field has no corresponding universality. The reward function of the search algorithm in the design of the Monte card Lu Shu is obtained based on a theoretical model of queuing theory M/G/n, and the change of fixed parameters exists in the field difference of the transfer in the face of different logistics. Simulation is performed by applying test indicators of actual logistics sorting into logistics sorting digital twins, for example: the indexes of peak capacity, package half-circle drop-in port proportion, average capacity and the like are used for dynamically adjusting corresponding parameters in a Monte Carlo tree search algorithm, such as: the method comprises the steps that the total number n of package card categories corresponding to the number difference of grid arrangement is the package loading rate corresponding to the number difference of shelves on packages, the fixed rate difference is set by a conveyor belt to correspond to the package arrival rate of each material flow sorting grid, and the like, so that the corresponding optimal sorting strategy is obtained.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for sorting a stream based on reinforcement learning and digital twins, which is characterized in that: the method comprises the following steps:

s1, acquiring historical cargo data in a logistics sorting system;

the step S4 includes:

s401, similarity matrix according to package card categoriesState transition matrix->Designing reinforcement learning strategy network：

s402, designing a logistics sorting grid port estimated productivity value networkFor state-action setn , a) The package card category corresponding to the current material flow sorting grid is metnUnder the state, calculating the estimated capacity value of the estimated capacity value network:

s403, constructing leaf nodes of a Monte Carlo tree:

wherein each leaf node corresponds to one sorting grid, and each sorting grid only has one package of package card category,P(n,a) Is the previous transition probability, its value is equal to，/>For the evaluation initial value before the expansion of the current leaf node of the Monte Carlo tree,/is given>；/>Estimating an expanded reward function for the current leaf node of the Monte Carlo tree; />For the number of accesses before the expansion of the current leaf node of the Monte Carlo tree, +.>The number of selection times after the current leaf node of the Monte Carlo tree is unfolded; />The weighted value average for the current node:

is a preset weight;

when the maximum estimated productivity value is estimated for each Monte Carlo tree node, carrying out fixed times or simulated material flow sorting grid distribution allocation limiting fixed time under the current package card category state n so as to enable a strategy networkFor the estimated productivity value of the logistics sorting grid under the package card type state n +.>Approximation value network->For the estimated productivity value of the logistics sorting grid under the package card type state n +.>；

2. A reinforcement learning and digital twins based stream sorting method according to claim 1, characterized by: the historical cargo data in the logistics sorting system in step S1 includes: within a set time T, package card number information, package card flow information and package card aging information of each package in the logistics sorting system are included; the package card flow information comprises package flow city codes; the ageing information of the package card comprises aviation parts and land transportation parts.

3. A reinforcement learning and digital twins based stream sorting method according to claim 1, characterized by: the historical sorting data of the sorting grid of the sorting machine in the step S2 includes: the number of the sorting grids, the grid locking times of the sorting grids, and the grid locking time of each sorting grid when each grid is locked in the set time T, wherein the sorting grid locking time comprises the number of packages.

4. A reinforcement learning and digital twins based stream sorting method according to claim 1, characterized by: the step S2 includes:

s201, based on historical grid locking time and grid locking number data, fitting grid locking number functions and grid locking time functions of all grids respectively: setting the grid locking times of each sorting grid opening to be M in the time T;

a1, counting the number of packages contained in the sorting grid openings with each number at the moment when the ith grid is locked, fitting a relation function between the number of packages and the number of the sorting grid openings by using a discount approximation method, and marking asIn the function +.>Indicates sorting lattice number,/->Is numbered->The number of packages corresponding to the grid openings,x=1,2,…,X，Xindicates the number of sorting grids, +.>The method is used for representing the relation between the number of packages obtained by fitting and the sorting grid number;

counting the grid locking time of each numbered sorting grid, fitting the relation function of the grid locking time and the sorting grid number by using a recurrence approximation method, and marking asIn the function +.>Indicates sorting lattice number,/->Is numbered->The grid locking time length corresponding to the grid opening,x=1,2,…,X，Xindicates the number of sorting grids, +.>The method is used for representing the relation between the grid locking time length obtained by fitting and the sorting grid port number;

S202, calculating a grid processing efficiency function。

5. A reinforcement learning and digital twins based stream sorting method according to claim 2 or 4, characterized by: the set time T is a historical one-week time.

6. A reinforcement learning and digital twins based stream sorting method according to claim 1, characterized by: the step S3 includes:

Setting the number N of clusters to be equal to the number X of the grid openings, and clustering the package plate feature vectors through a clustering algorithm to obtain NThe package card categories are recorded as:the cluster center of each package card category is marked as +.>And counting the proportion of the packages of each package card category to the total packages: />；

s302, by calculationConstructing a similarity matrix between the nth package card category and all package card categories with Euclidean distance between feature vectors of each package card category>The matrix->Is a 1*N matrix, the kth column in the matrix representsAnd->European distance,/, of->When n=k, the calculated euclidean distance is 0;

after the similarity matrix is standardized, the method is obtainedAcquiring a state transition matrix between the current package card category and all package card categories>，/>Representing modulo, the matrix->Also a matrix of 1*N;

s303, atStep S303 is repeatedly executed to obtain a similarity matrix and a state transition matrix corresponding to each package card category.