CN112857373B - Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions - Google Patents

Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions Download PDF

Info

Publication number
CN112857373B
CN112857373B CN202110217057.7A CN202110217057A CN112857373B CN 112857373 B CN112857373 B CN 112857373B CN 202110217057 A CN202110217057 A CN 202110217057A CN 112857373 B CN112857373 B CN 112857373B
Authority
CN
China
Prior art keywords
robot
action
network
optical flow
actions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110217057.7A
Other languages
Chinese (zh)
Other versions
CN112857373A (en
Inventor
李治军
高铭浩
王勃然
金晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110217057.7A priority Critical patent/CN112857373B/en
Publication of CN112857373A publication Critical patent/CN112857373A/en
Application granted granted Critical
Publication of CN112857373B publication Critical patent/CN112857373B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an energy-saving unmanned vehicle path navigation method capable of minimizing useless actions, and belongs to the field of autonomous navigation. The energy-saving unmanned vehicle path navigation method comprises the following steps: step one, predicting corresponding actions executed by the robot according to the optical flow image and the optical flow image before the actions; step two, setting a new recall function, predicting the action which the robot should execute at present to avoid the obstacle according to the action sequence of the robot within a certain window in the past and the current observation, and reducing unnecessary and useless left-right swing steering actions while navigating. The invention avoids complex and complicated flow in the traditional SLAM method, and can reduce redundant useless swing actions made in the reinforcement learning decision process of the robot through visual information, thereby improving the navigation efficiency and reducing the consumption of redundant energy.

Description

Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions
Technical Field
The invention relates to an energy-saving unmanned vehicle path navigation method for minimizing useless actions, belonging to the field of autonomous navigation.
Background
Autonomous navigation is a long-term field of robotics research that provides an essential function for mobile robots to perform a series of tasks in the same daily environment of humans. Conventional vision-based navigation methods typically construct an environment map and then use path planning to reach the target. They typically rely on precision, high quality stereo cameras and other sensors (e.g., laser rangefinders) and are often computationally demanding.
Conventional navigation methods typically require solutions to many component issues, including map construction, motion planning, and robotic low-level control. The reliance of motion planning on high quality geometric maps, planned trajectories, and perfect positioning in conventional navigation methods results in these methods being complex and cumbersome and having poor generalization capabilities, which must be tailored to the specific environment. In recent years, reinforcement learning based methods have proven to be able to map raw sensor data directly end-to-end to control commands. This end-to-end approach reduces implementation complexity and effectively utilizes input data from different sensors (e.g., depth cameras, lasers), thereby reducing cost, power consumption, and computation time. Another advantage is that the end-to-end relationship between the input data and the control output can use arbitrary nonlinear complex models, which can work well with different control problems (e.g., lane tracking, autopilot and drone control).
However, the control strategy obtained by using the reinforcement learning method may generate more useless and redundant actions during navigation, for example, the control strategy can continuously swing left and right in place to acquire more local map information, which can cause the robot to become less efficient during navigation and consume additional energy.
Existing autonomous navigation techniques are usually solved by two methods:
1、SLAM:
meanwhile, the problems of positioning and map construction are solved. For the indoor navigation robot, the specific position of the indoor navigation robot in the room is known, the structural information of the whole room is known after the map is built, and the traditional path planning algorithm can be performed after the information is provided, so that the indoor navigation robot can reach the destination in the shortest distance. The robot starts at a certain place in an unknown environment, the surrounding environment information is observed through the sensor, the position, the gesture and the motion track of the robot are estimated by utilizing the camera information, and a map is built according to the gesture, so that the simultaneous positioning and the map building are realized. The method comprises the steps of firstly obtaining environmental information through a sensor, mainly extracting characteristic points according to an image, calculating the pose of a camera through transformation of the characteristic points, simultaneously carrying out closed loop detection, judging whether a robot reaches a place which is passed by previously, and optimizing the pose at the rear end to obtain a global optimal state. And finally, establishing a map according to the pose of the camera at each moment and the target information in the space.
2. Reinforcement learning navigation:
the method for reinforcement learning navigation mainly carries out interaction with the environment Env through the robot Agent, learns a strategy network Policy, directly maps the observation into a control instruction of the robot in an end-to-end mode, and does not need to construct a global map. Reinforcement learning scores actions performed by a robot during interaction (e.g., collisions may penalize agents and rewards for advancing targets), and agents target more reward, so that they learn a strategy to maximize the desired rewards according to a rewards and punishment mechanism. In the navigation task, the Agent obtains a strategy network through learning to realize an obstacle avoidance capability and simultaneously drive to a given target point.
They also suffer from respective disadvantages:
1. drawbacks of SLAM:
the map constructed by the base is a sparse point cloud image, only a part of characteristic points in the image are reserved as key points, the key points are fixed in space for positioning, and the existence of obstacles in the map is difficult to draw. When in initialization, the object with abundant alignment characteristics and geometric textures is kept in low-speed motion, frames are easy to lose during rotation, and particularly, the object is sensitive to pure rotation noise and has no scale invariance. Because the environment structure is subject to change, the constructed map may not match the state of the current environment, and thus the map still needs to be reconstructed. In addition, SLAM is composed of a large number of modules, is complex and cumbersome, and errors can accumulate among the modules, resulting in inaccurate methods lacking generalization capability, and requiring sensors with high accuracy.
2. Disadvantages of reinforcement learning navigation:
the method for performing navigation by reinforcement learning directly predicts the action performed by the robot from end to end due to the lack of a power model and global information, which may cause the robot to generate unnecessary actions (for example, when the robot passes through a door, a great amount of side-to-side actions are often generated to acquire more local map information), further leading to low navigation efficiency, and thus, additional energy consumption.
To introduce some additional information during navigation, an optical flow image is added to the vision. The use of optical flow may introduce additional amounts of visual motion information. Optical flow may represent a motion relationship between two images, one before and one after the other, representing the movement of pixels in the image in the x, y axes. Optical flow is used in motion recognition in large numbers as an input to neural networks because it can represent well the state of motion of objects in images. Typically, the neural network architecture in this task uses a dual-branch architecture, inputting a current RGB image and an optical flow sequence between frames of video to predict what actions a person in the current image will do. In most of the current optical flow action recognition tasks, action recognition is performed on a video called by a third person, and the action of the person in the video is mainly recognized. The addition of the optical flow prediction motion in the robot navigation is the motion recognition of the first person, and the optical flow image is different because the camera is basically motionless and the object moves in the third person motion recognition, and the background is basically motionless and the camera moves in the first person motion recognition in the robot navigation. Meanwhile, the robot cannot immediately change the action state due to inertial factors in the action executing process, so that an optical flow image is inaccurate, and the accuracy of action recognition is reduced.
Disclosure of Invention
The invention aims to provide an energy-saving unmanned vehicle path navigation method capable of minimizing useless actions so as to solve the problems in the prior art.
An energy-saving unmanned vehicle path navigation method for minimizing useless actions, comprising the following steps:
step one, predicting corresponding actions executed by the robot according to the optical flow image and the optical flow image before the actions;
setting a new forward function, predicting the action which the robot should execute at present according to the action sequence of the robot in the range of the previous n steps length, namely the action predicted by the optical flow images in the previous time steps and the current observation, so as to avoid obstacles, and reducing redundant and useless left-right swing steering actions while navigating, wherein n is less than or equal to 10.
Further, in the first step, the method specifically includes the following steps:
the method comprises the steps that one by one, a robot obtains an RGB observation image of a current frame and an RGB observation image of a previous frame;
step two, predicting dense optical flow through LiteFlowNet based on the RGB observed image of the current frame and the RGB observed image of the previous frame;
step one, enabling the robot to continuously act in a simulation environment, and storing the calculated optical flow image and the action type label correspondingly executed to generate a training data set;
step four, constructing a neural network classifier, inputting an optical flow image in a data set into the neural network classifier, predicting an action type label, and training an action classifier in a supervised learning mode;
and fifthly, inputting the optical flow image obtained after the last execution of the action and the optical flow observation image of the last frame into a classifier to obtain a current action prediction result.
Further, the classifier is composed of a plurality of convolution layers and a full connection layer, and the action type label is the type of the action corresponding to the optical flow image.
Further, in the second step, the method specifically includes the following steps:
step two, the robot obtains an observation s at the current time t t According to the observed RGB-D observation image, the robot predicts the action a which should be executed currently through the Policy network t Robot execution a t The action interacts with the environment while the environment feedback is fed to a t Act a review and get a new observation s t+1 Performing a new round of motion prediction, repeating the process until the robot reaches a given target position point, training a Policy network through reinforcement learning, so that the predicted motion obtains more reward and learns navigation capacity;
step two, in the Policy network, adding the last action of the optical flow branch prediction robot, and adding the last action into a past action sequence to calculate a new reward function;
step two, connecting the RGB-D image characteristic vector, the optical flow branch prediction motion embedding vector and the target position information which are extracted by the RGB-D branch convolution layer together, and inputting the connected information into a motion decision network to predict the motion which should be executed currently;
and step two, continuing training the Policy network according to a new punishment forward function to construct a relation between the current predicted action and the action in the past period of time, so that the robot reduces redundant useless rotation and swing actions.
Further, in the step two to the step two, specifically: the currently predicted actions should get additional penalties according to the degree of robot sway, i.e. the calculation formula of reward is as follows:
wherein the first term is punishment of each time step, the second term is punishment of the robot during collision, the third term is additionally introduced review term, punishment is carried out on swinging motions made by the robot, swinging_count is swinging degree, the fourth term is punishment of the robot during navigation completion according to the times of left-right turning motions in the past motion sequence, the last term is punishment of the robot during approaching distance from a target, and punishment is carried out when the robot is far away, wherein the third term is punishment of the robot during the swinging motion, the swinging_count is swinging degree, the fourth term is punishment of the robot during navigation completion, and the last term is punishment of the robot during the distance from the targetFor rewarding coefficient D t D for distance to the target after performing the action t-1 In order to perform the distance from the target before the action,
in the process of robot and environment interaction, continuously collecting the observation and corresponding action labels and storing the generated reorder sequence into a buffer, after a certain number of steps are collected, taking out the sequence according to a reinforcement learning policy gradient method by using the following functions, and calculating an objective function J (pi) θ ) A gradient is obtained for updating the Policy network, so that the current robot can obtain the maximum expected benefit through the current Policy network,
wherein the A function is used for calculating the S t Action a is performed in a state t The function of the degree of quality, i.e. the degree to which the action can obtain a reorder, is made up of a fully connected layer, the log function is used to calculate the state s t Lower prediction of a t Probability of pi θ For a policy network, θ is a parameter of the policy network, E is a desired function,
in order to solve the problem of low sampling efficiency in the policy gradient, a PPO reinforcement learning method is used in training, a policy network with fixed parameters is used for sampling the observation and corresponding action sequences and the report sequences, and then the gradient is obtained by adjustment and calculation according to the difference between the current policy network and the distribution of the policy network used in sampling, so that the expectation of calculating the current network report through the sampling network is realized, and the data utilization rate is improved, namely, the method comprises the following steps of the functions in a fractional form:
where θ' is a parameter of the current network, p θ At s t In state through policy network pi θ Obtaining a t Is a function of the probability of (1),
after repeated iterative updating, the parameters of the sampling network are updated, namely the strategy network parameters at the moment are used, and meanwhile, a punishment factor is added into the objective function, namely the last term in the formula, when the two network parameters are greatly different, a larger punishment is given, the network updating speed is controlled,
wherein KL (θ, θ) k ) To calculate the difference function of two network parameters, beta is the parameter that controls the update rate of the network,
when training the Policy network by reinforcement learning, the navigation capability is trained independently without adding optical flow branches, namely, using the reward except the third term in the reward function, then introducing an optical flow action classifier, freezing the parameters of optical flow and RGB-D branches and optical flow branches, and only updating the parameters of the later action decision network.
The invention has the following advantages: according to the energy-saving unmanned vehicle path navigation method capable of minimizing the useless actions, the new review is set in reinforcement learning, the optical flow information is introduced to judge the action type of the robot, and the navigation actions which the current robot should execute are predicted only through visual information by combining with other observation information and inputting the information into an action decision network, so that complex and complicated processes in the traditional SLAM method are avoided, unnecessary and useless swing actions in the decision process of the robot can be reduced, the navigation efficiency is improved, and meanwhile, the consumption of redundant energy is reduced.
Drawings
FIG. 1 is an optical flow image generated when a robot performs an action;
fig. 2 is a diagram of a Policy network architecture.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 2, wherein the image feature extraction network part is composed of two branch groups. The optical flow branches firstly predict corresponding optical flow images through RGB images and LiteFlowNet, then obtain action types through a classifier network and convert the action types into embedded vectors. The RGB-D branch extracts the feature vector through the convolution network and the full connection layer. And splicing the two vectors with the relative target to obtain a new vector which is used as the input of the action strategy network behind the network, and finally sampling the action to be executed.
An energy-saving unmanned vehicle path navigation method for minimizing useless actions, comprising the following steps:
step one, predicting corresponding actions executed by the robot according to the optical flow image and the optical flow image before the actions;
setting a new forward function, predicting the action which the robot should execute at present according to the action sequence of the robot in the range of the previous n steps length, namely the action predicted by the optical flow images in the previous time steps and the current observation, so as to avoid obstacles, and reducing redundant and useless left-right swing steering actions while navigating, wherein n is less than or equal to 10.
Further, in the first step, the method specifically includes the following steps:
the method comprises the steps that one by one, a robot obtains an RGB observation image of a current frame and an RGB observation image of a previous frame;
step two, predicting dense optical flow through LiteFlowNet based on the RGB observed image of the current frame and the RGB observed image of the previous frame;
step one, enabling the robot to continuously act in a simulation environment, and storing the calculated optical flow image and the action type label correspondingly executed to generate a training data set;
step four, constructing a neural network classifier, inputting an optical flow image in a data set into the neural network classifier, predicting an action type label, and training an action classifier in a supervised learning mode;
and fifthly, inputting the optical flow image obtained after the last execution of the action and the optical flow observation image of the last frame into a classifier to obtain a current action prediction result.
Specifically, referring to fig. 1, the rightmost pixel is a color legend corresponding to a vector generated when the pixel moves in different directions, the left red image corresponds to a left turning motion of the robot, the pixel point moves to the right, and the blue color corresponds to a right turning motion. The two pictures on the left appear substantially solid because the background is stationary and only the camera is moving, so the pixels of the background are transformed in their entirety. The third is the effect of the forward motion, and the pixels are moved divergently around, so that a color image is produced according to the legend of the pixel motion. When the optical flow prediction motion is used, the motion state cannot be immediately converted due to the inertia factor of the robot, for example, the robot obtains a left turn command when performing a right turn motion, and at this time, the robot cannot immediately be converted into the left turn motion state, so that the optical flow image obtained still turns right. Although the two operation states are the same, the degree of rotation of the robot is somewhat different, and the two optical flow images are different, so that the previous optical flow image is added for prediction, and the influence of inertia is reduced as much as possible.
Further, the classifier is composed of a plurality of convolution layers and a full connection layer, and the action type label is the type of the action corresponding to the optical flow image.
Further, in the second step, the method specifically includes the following steps:
step two, the robot obtains an observation st at the current time t, predicts the action at which the robot should execute at present through a Policy network according to the observed RGB-D observation image, the robot executes the at action to interact with the environment, the environment feeds back to the at action for one review, a new observation st+1 is obtained, a new round of action prediction is carried out, the process is repeated until the robot reaches a given target position point, and the Policy network is trained through reinforcement learning, so that the predicted action obtains more reward and learns navigation capacity;
step two, in the Policy network, adding the last action of the optical flow branch prediction robot, and adding the last action into a past action sequence to calculate a new reward function;
step two, connecting the RGB-D image characteristic vector, the optical flow branch prediction motion embedding vector and the target position information which are extracted by the RGB-D branch convolution layer together, and inputting the connected information into a motion decision network to predict the motion which should be executed currently;
and step two, continuing training the Policy network according to a new punishment forward function to construct a relation between the current predicted action and the action in the past period of time, so that the robot reduces redundant useless rotation and swing actions.
Further, in the step two to the step two, specifically: the currently predicted actions should get additional penalties according to the degree of robot sway, i.e. the calculation formula of reward is as follows:
wherein the first term is punishment of each time step, the second term is punishment of the robot during collision, the third term is additionally introduced review term, punishment is carried out on swinging motions made by the robot, swinging_count is swinging degree, the swinging degree is obtained according to the times of left-right turning motions in a past motion sequence, the fourth term is punishment of the robot during navigation completion, the last term is punishment obtained when the distance between the robot and a target is close, and punishment is carried out when the robot is far away, wherein the first term is punishment of the robot during the navigation completion, the second term is punishment of the robot during the distance between the robot and the target, and the third term is punishment of the robot during the distance between the robot and the target distance, and the target distance between the robot and the robotFor rewarding coefficient D t D for distance to the target after performing the action t-1 In order to perform the distance from the target before the action,
in the process of robot and environment interaction, continuously collecting the observation and corresponding action labels and storing the generated reorder sequence into a buffer, after a certain number of steps are collected, taking out the sequence according to a reinforcement learning policy gradient method by using the following functions, and calculating an objective function J (pi) θ ) A gradient is obtained for updating the Policy network, so that the current robot can obtain the maximum expected benefit through the current Policy network,
wherein the A function is used for calculating the S t Action a is performed in a state t A function of the degree of quality, i.e. the degree to which the action can obtain a review, is made up of a fully connected layer, loThe g-function is used to calculate the state s t Lower prediction of a t Probability of pi θ For a policy network, θ is a parameter of the policy network, E is a desired function,
in order to solve the problem of low sampling efficiency in the policy gradient, a PPO reinforcement learning method is used in training, sampling observation and corresponding action sequences and a report sequence are carried out through a policy network with fixed parameters, then the gradient is obtained by adjustment and calculation according to the difference between the current policy network and the distribution of the policy network used in sampling, so that the expectation of calculating the current network report through the sampling network is realized, and the data utilization rate is improved, namely, the method comprises the following steps of the functions in a fractional form:
where θ' is a parameter of the current network, p θ At s t In state through policy network pi θ Obtaining a t Is a function of the probability of (1),
after repeated iterative updating, the parameters of the sampling network are updated, namely the strategy network parameters at the moment are used, and meanwhile, a punishment factor is added into the objective function, namely the last term in the formula, when the two network parameters are greatly different, a larger punishment is given, the network updating speed is controlled,
wherein KL (θ, θ) k ) To calculate the difference function of two network parameters, beta is the parameter that controls the update rate of the network,
when training the Policy network by reinforcement learning, the navigation capability is trained independently without adding optical flow branches, namely, using the reward except the third term in the reward function, then introducing an optical flow action classifier, freezing the parameters of optical flow and RGB-D branches and optical flow branches, and only updating the parameters of the later action decision network.

Claims (3)

1. An energy-saving unmanned vehicle path navigation method for minimizing useless actions is characterized by comprising the following steps:
step one, predicting corresponding actions executed by the robot according to the optical flow image and the optical flow image before the actions;
setting new recall function, predicting the action to be executed by the robot to avoid obstacle according to the action sequence of the robot in the previous n step length ranges, i.e. the action predicted by the optical flow image in the previous time steps and the current observation, and reducing unnecessary and useless left-right swing steering actions while navigating, wherein n is less than or equal to 10,
in the first step, the method specifically comprises the following steps:
the method comprises the steps that one by one, a robot obtains an RGB observation image of a current frame and an RGB observation image of a previous frame;
step two, predicting dense optical flow through LiteFlowNet based on the RGB observed image of the current frame and the RGB observed image of the previous frame;
step one, enabling the robot to continuously act in a simulation environment, and storing the calculated optical flow image and the action type label correspondingly executed to generate a training data set;
step four, constructing a neural network classifier, inputting an optical flow image in a data set into the neural network classifier, predicting an action type label, and training an action classifier in a supervised learning mode;
fifthly, inputting an optical flow image obtained after the last execution of the action and an optical flow observation image of the last frame into a classifier to obtain a current action prediction result;
in the second step, the method specifically comprises the following steps:
step two, the robot obtains an observation s at the current time t t According to the observed RGB-D observation image, the robot predicts the action a which should be executed currently through the Policy network t Robot execution a t The actions interact with the environment while the ringFeedback of context to a t Act a review and get a new observation s t+1 Performing a new round of motion prediction, repeating the process until the robot reaches a given target position point, training a Policy network through reinforcement learning, so that the predicted motion obtains more reward and learns navigation capacity;
step two, in the Policy network, adding the last action of the optical flow branch prediction robot, and adding the last action into a past action sequence to calculate a new reward function;
step two, connecting the RGB-D image characteristic vector, the optical flow branch prediction motion embedding vector and the target position information which are extracted by the RGB-D branch convolution layer together, and inputting the connected information into a motion decision network to predict the motion which should be executed currently;
and step two, continuing training the Policy network according to a new punishment forward function to construct a relation between the current predicted action and the action in the past period of time, so that the robot reduces redundant useless rotation and swing actions.
2. The method for navigating an energy-saving unmanned vehicle path to minimize unwanted actions according to claim 1, wherein the classifier is composed of a plurality of convolution layers and a full connection layer, and the action type label is a type of action corresponding to the optical flow image.
3. The method for navigating an energy efficient unmanned vehicle to minimize unwanted motion according to claim 1, wherein in step two to step two, specifically: the currently predicted actions should get additional penalties according to the degree of robot sway, i.e. the calculation formula of reward is as follows:
wherein the first term is punishment of each time step, and the second term is punishment of the robot during collisionThe third term is additionally introduced forward term, punishment is carried out on swinging motion of the robot, the swinging_count is swinging degree, the swinging degree is obtained according to the times of left-right turning motion in the past motion sequence, the fourth term is rewarded by the robot for completing navigation, the last term is rewarded when the robot approaches to the target distance, punishment is carried out when the robot is far away, and the robot is far away, whereinFor rewarding coefficient D t D for distance to the target after performing the action t-1 In order to perform the distance from the target before the action,
in the process of robot and environment interaction, continuously collecting the observation and corresponding action labels and storing the generated reorder sequence into a buffer, after a certain number of steps are collected, taking out the sequence according to a reinforcement learning policy gradient method by using the following functions, and calculating an objective function J (pi) θ ) A gradient is obtained for updating the Policy network, so that the current robot can obtain the maximum expected benefit through the current Policy network,
wherein the A function is used for calculating the S t Action a is performed in a state t The function of the degree of quality, i.e. the degree to which the action can obtain a reorder, is made up of a fully connected layer, the log function is used to calculate the state s t Lower prediction of a t Probability of pi θ For a policy network, θ is a parameter of the policy network, E is a desired function,
in order to solve the problem of low sampling efficiency in the policy gradient, a PPO reinforcement learning method is used in training, a policy network with fixed parameters is used for sampling the observation and corresponding action sequences and the report sequences, and then the gradient is obtained by adjustment and calculation according to the difference between the current policy network and the distribution of the policy network used in sampling, so that the expectation of calculating the current network report through the sampling network is realized, and the data utilization rate is improved, namely, the method comprises the following steps of the functions in a fractional form:
where θ' is a parameter of the current network, p θ At s t In state through policy network pi θ Obtaining a t Is a function of the probability of (1),
after repeated iterative updating, the parameters of the sampling network are updated, namely the strategy network parameters at the moment are used, and meanwhile, a punishment factor is added into the objective function, namely the last term in the formula, when the two network parameters are greatly different, a larger punishment is given, the network updating speed is controlled,
wherein KL (θ, θ) k ) To calculate the difference function of two network parameters, beta is the parameter that controls the update rate of the network,
when training the Policy network by reinforcement learning, the navigation capability is trained independently without adding optical flow branches, namely, using the reward except the third term in the reward function, then introducing an optical flow action classifier, freezing the parameters of optical flow and RGB-D branches and optical flow branches, and only updating the parameters of the later action decision network.
CN202110217057.7A 2021-02-26 2021-02-26 Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions Active CN112857373B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110217057.7A CN112857373B (en) 2021-02-26 2021-02-26 Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110217057.7A CN112857373B (en) 2021-02-26 2021-02-26 Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions

Publications (2)

Publication Number Publication Date
CN112857373A CN112857373A (en) 2021-05-28
CN112857373B true CN112857373B (en) 2024-02-20

Family

ID=75990127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110217057.7A Active CN112857373B (en) 2021-02-26 2021-02-26 Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions

Country Status (1)

Country Link
CN (1) CN112857373B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113776537B (en) * 2021-09-07 2024-01-19 山东大学 De-centralized multi-agent navigation method and system in unmarked complex scene

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852273A (en) * 2019-11-12 2020-02-28 重庆大学 Behavior identification method based on reinforcement learning attention mechanism
CN111141300A (en) * 2019-12-18 2020-05-12 南京理工大学 Intelligent mobile platform map-free autonomous navigation method based on deep reinforcement learning
CN111780777A (en) * 2020-07-13 2020-10-16 江苏中科智能制造研究院有限公司 Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852273A (en) * 2019-11-12 2020-02-28 重庆大学 Behavior identification method based on reinforcement learning attention mechanism
CN111141300A (en) * 2019-12-18 2020-05-12 南京理工大学 Intelligent mobile platform map-free autonomous navigation method based on deep reinforcement learning
CN111780777A (en) * 2020-07-13 2020-10-16 江苏中科智能制造研究院有限公司 Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于视觉SLAM的无人运转车导航***设计;吕海泳;蔡建宁;袁贺男;;山东工业技术(第03期);全文 *
室内移动机器人位姿估计与避障方法研究.《中国优秀硕士学位论文全文数据库》.2022,全文. *

Also Published As

Publication number Publication date
CN112857373A (en) 2021-05-28

Similar Documents

Publication Publication Date Title
Ruan et al. Mobile robot navigation based on deep reinforcement learning
Gandhi et al. Learning to fly by crashing
US11561544B2 (en) Indoor monocular navigation method based on cross-sensor transfer learning and system thereof
CN110874578B (en) Unmanned aerial vehicle visual angle vehicle recognition tracking method based on reinforcement learning
CN110632931A (en) Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN116263335A (en) Indoor navigation method based on vision and radar information fusion and reinforcement learning
CN115860107B (en) Multi-machine searching method and system based on multi-agent deep reinforcement learning
Saksena et al. Towards behavioural cloning for autonomous driving
CN111580526B (en) Cooperative driving method for fixed vehicle formation scene
CN112857370A (en) Robot map-free navigation method based on time sequence information modeling
CN112857373B (en) Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions
CN115877869A (en) Unmanned aerial vehicle path planning method and system
Guo et al. A deep reinforcement learning based approach for AGVs path planning
CN115265547A (en) Robot active navigation method based on reinforcement learning in unknown environment
Liu et al. Multi-agent trajectory prediction with graph attention isomorphism neural network
Ma et al. Using RGB image as visual input for mapless robot navigation
Li A hierarchical autonomous driving framework combining reinforcement learning and imitation learning
CN117475279A (en) Reinforced learning navigation method based on target drive
Ejaz et al. Autonomous visual navigation using deep reinforcement learning: An overview
CN117553798A (en) Safe navigation method, equipment and medium for mobile robot in complex crowd scene
CN116734850A (en) Unmanned platform reinforcement learning autonomous navigation system and method based on visual input
CN116679710A (en) Robot obstacle avoidance strategy training and deployment method based on multitask learning
US20240054008A1 (en) Apparatus and method for performing a task
Zhang et al. Visual navigation of mobile robots in complex environments based on distributed deep reinforcement learning
Li et al. End-to-end autonomous exploration for mobile robots in unknown environments through deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant