CN114186712A

CN114186712A - Container loading and unloading intelligent method and system based on reinforcement learning

Info

Publication number: CN114186712A
Application number: CN202111284086.1A
Authority: CN
Inventors: 孔雨昕; 陈志勇; 史玉良
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-03-15

Abstract

The invention provides a container loading and unloading intelligent method based on reinforcement learning, which comprises the following steps: acquiring container size data and a loading and unloading plan; acquiring initial states, parameters and allocation tasks of a plurality of agents; according to the allocation task, a decision result is obtained by using an agent reinforcement learning model; the intelligent agent comprises a storage yard intelligent agent, a loading and unloading equipment intelligent agent and a berthing intelligent agent, the decision result is obtained by utilizing an intelligent agent reinforcement learning model, the allocation task and the intelligent agent are divided, the intelligent agent reinforcement learning model is established, and the state space, the action space and the reward value of each intelligent agent in the environment are set. The container loading and unloading method based on the multi-agent reinforcement learning establishes the reinforcement learning model, automatically generates the container loading and unloading plan through the continuous iterative learning of the multi-agent system, replaces the old mode of manually making the loading and unloading plan by service personnel, realizes the container loading and unloading flow intelligentization based on the multi-agent reinforcement learning, and lays a foundation for the construction of an intelligent port.

Description

Container loading and unloading intelligent method and system based on reinforcement learning

Technical Field

The invention relates to the technical field of container loading and unloading intellectualization, in particular to a container loading and unloading intellectualization method and system based on reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The early container handling problem is mainly solved by depending on the experience of port staff, the port staff is required to comprehensively consider the problems of a berth idle state, a handling equipment specification and an idle state, a yard idle state, a berth specification and the like, and a related plan is made, but along with the increase of the current port traffic and the expansion of the cargo handling scale, the manual making of the handling and yard plan by the port staff alone is difficult to meet the production requirement.

The existing research methods do not realize full-automatic planning and scheduling in the aspects of site allocation, wharf berth allocation, mechanical facility scheduling and the like, and do not closely combine shore bridge scheduling optimization, yard scheduling optimization and container loading and unloading scheduling optimization. Meanwhile, reinforcement learning is a popular field and is not actually applied to the production practice of the port industry. Therefore, there is a need for an intelligent container loading and unloading method and system based on reinforcement learning.

Disclosure of Invention

The invention provides a container loading and unloading intelligent method and system based on reinforcement learning, aiming at solving the problems.

According to some embodiments, the invention adopts the following technical scheme:

an intelligent container loading and unloading method based on reinforcement learning comprises the following steps:

acquiring container size data and a loading and unloading plan;

acquiring initial states, parameters and allocation tasks of a plurality of agents;

according to the allocation task, a decision result is obtained by using an agent reinforcement learning model;

the intelligent agent comprises a storage yard intelligent agent, a loading and unloading equipment intelligent agent and a berthing intelligent agent, the decision result is obtained by utilizing an intelligent agent reinforcement learning model, the allocation task and the intelligent agent are divided, the intelligent agent reinforcement learning model is established, and the state space, the action space and the reward value of each intelligent agent in the environment are set.

Furthermore, the initial state of the agent includes the use condition and berth specification of the berthing agent, the use condition, specification and type of the handling equipment agent, and the presence box condition of the storage yard agent.

And further, the intelligent agent obtains decision-making actions according to the distributed tasks and the self state, and calculates the stored decision-making actions and state characteristics after K steps of calculation.

Further, the intelligent agent uploads the formed state characteristics, decision actions and reward values to an experience pool after the decision actions in the step K to form experiences.

And further, after the experience pool is uploaded, aggregating the parameters of the intelligent agent, and returning an aggregation result to the intelligent agent for parameter updating.

And further, after the parameters are updated, sampling all experiences according to sampling probability, and the intelligent agent iteratively trains and optimizes the intelligent agent reinforcement learning model according to the sampling experiences until all tasks are completed.

Further, the aggregating the parameters of the agent comprises updating the formula by using the parameters of the agent and the cost function, and generating the TD-error.

An intelligent container handling system based on reinforcement learning, comprising:

the data acquisition module is configured to acquire the initial state, parameters and allocation tasks of the intelligent agent;

the reinforcement learning module is configured to obtain a decision result by utilizing an intelligent body reinforcement learning model according to the distribution task;

and the intelligent agent obtains a decision action according to the distribution target through a reinforcement learning model, and temporarily stores the decision action and the corresponding state characteristics.

A computer readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute a reinforcement learning based container handling intelligence method.

A terminal device comprising a processor and a computer readable storage medium, the processor being configured to implement instructions; the computer readable storage medium stores instructions adapted to be loaded by a processor and to perform the reinforcement learning based container handling intelligence method.

Compared with the prior art, the invention has the beneficial effects that:

the invention starts from the container loading and unloading operation process, deeply analyzes the linkage relation among the yard, the loading and unloading equipment and the berth, enables the intelligent realization of the container loading and unloading process to be possible, overcomes the defect that the existing container loading and unloading plan needs the manual formulation of wharf service personnel, generates the container loading and unloading plan and the yard plan as accurately as possible, and reduces the manual intervention to the maximum extent; based on the thought of multi-agent reinforcement learning, a reinforcement learning model is established, a container loading and unloading plan is automatically generated through continuous iterative learning of a multi-agent system, the old mode that business personnel manually make the loading and unloading plan is replaced, container loading and unloading flow intelligentization based on the multi-agent reinforcement learning is realized, and a foundation is laid for the construction of an intelligent port.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. FIG. 1 is a schematic flow chart of an intelligent container loading and unloading method based on multi-agent reinforcement learning according to the present invention;

FIG. 2 is a schematic diagram of interaction between agent i and the environment in the multi-agent reinforcement learning model of the present invention;

FIG. 3 is a schematic diagram of an algorithm flow of an intelligent container loading and unloading method based on multi-agent reinforcement learning according to the present invention;

FIG. 4 is a diagram illustrating reward trends for each agent in the multi-agent reinforcement learning model of the present invention;

FIG. 5 is a system modeling structure diagram of an intelligent container handling method based on multi-agent reinforcement learning according to the present invention;

FIG. 6 is a schematic view showing the analysis and comparison of the loading and unloading costs in the container loading and unloading process according to the present invention;

fig. 7 is a schematic view showing the analysis and comparison of the loading and unloading time in the container loading and unloading process of the present invention.

The specific implementation mode is as follows:

the invention is further described with reference to the following figures and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example 1.

As shown in fig. 1, an intelligent container handling method based on reinforcement learning includes:

acquiring container size data and a loading and unloading plan;

The method specifically comprises the following steps:

1) dividing tasks and intelligent agents, establishing a reinforcement learning model for each intelligent agent, and setting state space, action space and reward value of the intelligent agent in the environment;

first, the container handling task is divided, the container handling plan is decomposed into a plurality of handling subtasks according to the container type, and the subtasks are sequentially executed.

In this example, the yard is divided by regions according to the order of berths; each region is divided into blocks according to the sequence of land survey measured by the sea; each block is further divided into "bei" blocks, corresponding to the BAY of the ship's box space. Each bay is 20GP container size and numbered sequentially, e.g., if the container size is 40GP, two adjacent bays are occupied.

Stacking rules are as follows: containers of the same type may exist in different shellfish places, and containers of different types are stored in different shellfish places, and each shellfish place can only store containers of the same type.

Second, the agent is partitioned. 3 agents in the container loading and unloading operation, namely a storage yard agent, a loading and unloading equipment agent and a berth agent, are numbered as 1, 2 and 3; if the state space is S, the action space is A and the reward value is R, a triple is generated

Wherein the content of the first and second substances,

respectively showing the state, action and reward of each agent i (i is more than or equal to 1 and less than or equal to 3) after the container n triggers the loading and unloading event at the time t of the e-th training round.

State space: that is, the specific characteristics observed by each agent at the current time, the state space of the berthing agent in this example includes: the position and the use condition of the berth and the berth specification; intelligent loading and unloading equipmentThe state space of (a) includes: specification, type, location and use of handling equipment; the state space of the yard agent includes: the status of the bye placement and the status of the presence box. In the present case, it is preferred that,

wherein p isⁱ,p^jRespectively representing basic information such as the specification and the position of the agent i,

indicating the current agent i usage at time instant e-th round t.

An action space: an action space is established for each agent for storing actions that each agent may generate. In the present case, it is preferred that,

the action decision of the intelligent agent after the container n triggers the loading and unloading operation starting event at the time t of the e-th round is shown, and the actually required actions include whether to execute the loading and unloading operation of the current container, and the specific loading and unloading equipment for executing the actions and the arrangement situation of the subsequent yard berths.

The reward value is as follows: the setting of the reward value R of the multi-agent model for realizing the intelligent establishment of the container loading and unloading process mainly considers three specific aspects: real-time reward of individual r₁The reward of the current agent is fed back to the current environment in real time; second, global real-time award r₂The reward that the current environment gives to all agents; ultimate reward r of system₃I.e. the multi-agent system awards all agents globally after completing all tasks. Thus, the reward value for the agent is:

moreover, when the agent selects an action to obtain a real-time reward, the influence of the current action on future reward punishment obtaining conditions needs to be considered, so that the reward value of the agent at the moment t is represented as

Wherein, gamma represents a discount factor, and the larger gamma represents the more recent reward, the larger influence on the current is; smaller γ means that the reward has less influence on the current.

In this example, if the current status is

The next state is

The expected state is

Then

Such as if

Then r is₁Positive, indicated as near target; such as if

Then r is₁Negative, indicating a distance from the target. In particular, r₁、r₂Always satisfy r₁≥r₂。

2) Acquiring initial states of all agents, and taking the initial states as initial states in a reinforcement learning model;

the current state of the intelligent body comprises the use condition and the berth specification of the berthing intelligent body, the use condition, the specification and the type of the loading and unloading equipment intelligent body and the situation of a box in the yard of the intelligent body. In this example, the initial state of agent i is obtainedAnd is recorded as

T representing the 1 st training round₀And (3) the state of each agent i (i is more than or equal to 1 and less than or equal to 3) after the container n triggers the loading and unloading event.

3) And the intelligent agent decides the next action according to the distributed target and the current self state, interacts with the environment to obtain the next decision, and temporarily stores the generated state characteristics and decision actions. The concrete steps are step S3-1 to step S3-4.

Step S3-1, the agent i observes the environment to obtain the current state value

Step S3-2, agent i passes action

And observing the environment to obtain the state

Then new state is added

Sending the information to the agent i + 1;

step S3-3, the agent i obtains the real-time feedback reward value r given by the external environment₁；

Step S3-4, saving the state values of all agents from step S3-1 to S3-3

Next step status value

Motion value

And real-time feedback reward value r ═ r (r)₁,r₂,…,r_n) And temporarily storing the experience.

4) After a period of time (K step), calculating the reward value of the experience stored in the experience temporary storage area to form state characteristics, decision actions and reward values, uploading the state characteristics, decision actions and reward values to an experience pool, and performing iterative optimization training of a subsequent agent:

5) and aggregating the parameters uploaded by all the agents, and returning the result to each agent for parameter updating. In this example, each agent uploads local parameters to the multi-agent system, then the multi-agent system updates the formula according to the cost function to generate a TD-error, which is favorable for learning of the agent and should be sampled more, and meanwhile, since the agent is in the process of continuous iterative optimization, the experience generated at an earlier time is unfavorable for iterative training of the following agent, and therefore the sampling probability of the newly generated experience should be increased.

6) Sampling all experiences according to the sampling probability, and iteratively training and optimizing the intelligent agent model by each intelligent agent according to the sampling experiences;

FIG. 4 is a reward trend graph obtained by each agent during iterative training. Green, yellow, blue are the performance of agent 1, agent 2, agent 3 in the model, respectively. Initially, the rewards earned by the agent were very low and very unstable due to random variation of the model parameters. Along with continuous iterative optimization, the performance of each agent is better and better, and the obtained rewards are more and stable step by step. 7) And repeating the steps 3) to 6) until all tasks are completed. The system modeling structure is shown in fig. 5.

The embodiment relies on a comparison experiment carried out in a certain port, and fig. 6 and 7 are schematic diagrams respectively showing the comparison of container loading and unloading cost and loading and unloading time based on a multi-agent reinforcement learning model, so that the method not only solves the problem that manpower is consumed in manual planning in the practical application process, but also can reduce the container loading and unloading cost to a certain extent. Because the intelligent system continuously carries out iterative learning, the container loading and unloading method based on multi-intelligent agent reinforcement learning can always find the best loading and unloading operation mode in the actual operation process, and not only the container moving cost, the transportation cost and the box reversing cost are reduced by shortening the container stacking time and optimizing the transportation path, but also the storage cost and the like are correspondingly reduced due to the shortening of the loading and unloading time.

Example 2.

Example 3.

Example 4.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A container loading and unloading intelligent method based on reinforcement learning is characterized by comprising the following steps:

acquiring container size data and a loading and unloading plan;

2. The container handling intelligent method based on reinforcement learning of claim 1, wherein the initial state of the agent comprises the usage and berthing specifications of a berthing agent, the usage and specifications and types of a handling equipment agent, and the presence of a container of a storage yard agent.

3. The container handling intelligent method based on reinforcement learning as claimed in claim 2, wherein the agent obtains decision-making actions according to the assigned tasks and its own status, and after K steps of calculation, calculates the stored decision-making actions and status features.

4. The container handling intelligent method based on reinforcement learning as claimed in claim 3, wherein the agent uploads the formed status characteristics, decision actions and reward values to the experience pool after K decision actions to form experience.

5. The intelligent container handling method based on reinforcement learning of claim 4, wherein after the experience pool is uploaded, the parameters of the intelligent agent are aggregated, and the aggregated result is returned to the intelligent agent for parameter updating.

6. The intelligent reinforcement learning-based container handling method according to claim 5, wherein after the parameters are updated, all experiences are sampled according to sampling probability, and the intelligent agent iteratively trains and optimizes the intelligent agent reinforcement learning model according to the sampling experiences until all tasks are completed.

7. The container handling intelligent method based on reinforcement learning of claim 6, wherein the aggregating the parameters of the agent comprises updating a formula with the parameters of the agent and a cost function to generate TD-error.

8. An intelligent container handling system based on reinforcement learning, comprising:

9. A computer-readable storage medium characterized by: stored with instructions adapted to be loaded by a processor of a terminal device and to perform a reinforcement learning based container handling intelligence method according to any of claims 1-7.

10. A terminal device characterized by: the system comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform a reinforcement learning based container handling intelligence method of any of claims 1-7.