CN112857373B

CN112857373B - Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions

Info

Publication number: CN112857373B
Application number: CN202110217057.7A
Authority: CN
Inventors: 李治军; 高铭浩; 王勃然; 金晶
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-02-20
Anticipated expiration: 2041-02-26
Also published as: CN112857373A

Abstract

The invention discloses an energy-saving unmanned vehicle path navigation method capable of minimizing useless actions, and belongs to the field of autonomous navigation. The energy-saving unmanned vehicle path navigation method comprises the following steps: step one, predicting corresponding actions executed by the robot according to the optical flow image and the optical flow image before the actions; step two, setting a new recall function, predicting the action which the robot should execute at present to avoid the obstacle according to the action sequence of the robot within a certain window in the past and the current observation, and reducing unnecessary and useless left-right swing steering actions while navigating. The invention avoids complex and complicated flow in the traditional SLAM method, and can reduce redundant useless swing actions made in the reinforcement learning decision process of the robot through visual information, thereby improving the navigation efficiency and reducing the consumption of redundant energy.

Description

Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions

Technical Field

The invention relates to an energy-saving unmanned vehicle path navigation method for minimizing useless actions, belonging to the field of autonomous navigation.

Background

Autonomous navigation is a long-term field of robotics research that provides an essential function for mobile robots to perform a series of tasks in the same daily environment of humans. Conventional vision-based navigation methods typically construct an environment map and then use path planning to reach the target. They typically rely on precision, high quality stereo cameras and other sensors (e.g., laser rangefinders) and are often computationally demanding.

Conventional navigation methods typically require solutions to many component issues, including map construction, motion planning, and robotic low-level control. The reliance of motion planning on high quality geometric maps, planned trajectories, and perfect positioning in conventional navigation methods results in these methods being complex and cumbersome and having poor generalization capabilities, which must be tailored to the specific environment. In recent years, reinforcement learning based methods have proven to be able to map raw sensor data directly end-to-end to control commands. This end-to-end approach reduces implementation complexity and effectively utilizes input data from different sensors (e.g., depth cameras, lasers), thereby reducing cost, power consumption, and computation time. Another advantage is that the end-to-end relationship between the input data and the control output can use arbitrary nonlinear complex models, which can work well with different control problems (e.g., lane tracking, autopilot and drone control).

However, the control strategy obtained by using the reinforcement learning method may generate more useless and redundant actions during navigation, for example, the control strategy can continuously swing left and right in place to acquire more local map information, which can cause the robot to become less efficient during navigation and consume additional energy.

Existing autonomous navigation techniques are usually solved by two methods:

1、SLAM：

meanwhile, the problems of positioning and map construction are solved. For the indoor navigation robot, the specific position of the indoor navigation robot in the room is known, the structural information of the whole room is known after the map is built, and the traditional path planning algorithm can be performed after the information is provided, so that the indoor navigation robot can reach the destination in the shortest distance. The robot starts at a certain place in an unknown environment, the surrounding environment information is observed through the sensor, the position, the gesture and the motion track of the robot are estimated by utilizing the camera information, and a map is built according to the gesture, so that the simultaneous positioning and the map building are realized. The method comprises the steps of firstly obtaining environmental information through a sensor, mainly extracting characteristic points according to an image, calculating the pose of a camera through transformation of the characteristic points, simultaneously carrying out closed loop detection, judging whether a robot reaches a place which is passed by previously, and optimizing the pose at the rear end to obtain a global optimal state. And finally, establishing a map according to the pose of the camera at each moment and the target information in the space.

2. Reinforcement learning navigation:

the method for reinforcement learning navigation mainly carries out interaction with the environment Env through the robot Agent, learns a strategy network Policy, directly maps the observation into a control instruction of the robot in an end-to-end mode, and does not need to construct a global map. Reinforcement learning scores actions performed by a robot during interaction (e.g., collisions may penalize agents and rewards for advancing targets), and agents target more reward, so that they learn a strategy to maximize the desired rewards according to a rewards and punishment mechanism. In the navigation task, the Agent obtains a strategy network through learning to realize an obstacle avoidance capability and simultaneously drive to a given target point.

They also suffer from respective disadvantages:

1. drawbacks of SLAM:

the map constructed by the base is a sparse point cloud image, only a part of characteristic points in the image are reserved as key points, the key points are fixed in space for positioning, and the existence of obstacles in the map is difficult to draw. When in initialization, the object with abundant alignment characteristics and geometric textures is kept in low-speed motion, frames are easy to lose during rotation, and particularly, the object is sensitive to pure rotation noise and has no scale invariance. Because the environment structure is subject to change, the constructed map may not match the state of the current environment, and thus the map still needs to be reconstructed. In addition, SLAM is composed of a large number of modules, is complex and cumbersome, and errors can accumulate among the modules, resulting in inaccurate methods lacking generalization capability, and requiring sensors with high accuracy.

2. Disadvantages of reinforcement learning navigation:

the method for performing navigation by reinforcement learning directly predicts the action performed by the robot from end to end due to the lack of a power model and global information, which may cause the robot to generate unnecessary actions (for example, when the robot passes through a door, a great amount of side-to-side actions are often generated to acquire more local map information), further leading to low navigation efficiency, and thus, additional energy consumption.

To introduce some additional information during navigation, an optical flow image is added to the vision. The use of optical flow may introduce additional amounts of visual motion information. Optical flow may represent a motion relationship between two images, one before and one after the other, representing the movement of pixels in the image in the x, y axes. Optical flow is used in motion recognition in large numbers as an input to neural networks because it can represent well the state of motion of objects in images. Typically, the neural network architecture in this task uses a dual-branch architecture, inputting a current RGB image and an optical flow sequence between frames of video to predict what actions a person in the current image will do. In most of the current optical flow action recognition tasks, action recognition is performed on a video called by a third person, and the action of the person in the video is mainly recognized. The addition of the optical flow prediction motion in the robot navigation is the motion recognition of the first person, and the optical flow image is different because the camera is basically motionless and the object moves in the third person motion recognition, and the background is basically motionless and the camera moves in the first person motion recognition in the robot navigation. Meanwhile, the robot cannot immediately change the action state due to inertial factors in the action executing process, so that an optical flow image is inaccurate, and the accuracy of action recognition is reduced.

Disclosure of Invention

The invention aims to provide an energy-saving unmanned vehicle path navigation method capable of minimizing useless actions so as to solve the problems in the prior art.

An energy-saving unmanned vehicle path navigation method for minimizing useless actions, comprising the following steps:

step one, predicting corresponding actions executed by the robot according to the optical flow image and the optical flow image before the actions;

setting a new forward function, predicting the action which the robot should execute at present according to the action sequence of the robot in the range of the previous n steps length, namely the action predicted by the optical flow images in the previous time steps and the current observation, so as to avoid obstacles, and reducing redundant and useless left-right swing steering actions while navigating, wherein n is less than or equal to 10.

Further, in the first step, the method specifically includes the following steps:

the method comprises the steps that one by one, a robot obtains an RGB observation image of a current frame and an RGB observation image of a previous frame;

step two, predicting dense optical flow through LiteFlowNet based on the RGB observed image of the current frame and the RGB observed image of the previous frame;

step one, enabling the robot to continuously act in a simulation environment, and storing the calculated optical flow image and the action type label correspondingly executed to generate a training data set;

step four, constructing a neural network classifier, inputting an optical flow image in a data set into the neural network classifier, predicting an action type label, and training an action classifier in a supervised learning mode;

and fifthly, inputting the optical flow image obtained after the last execution of the action and the optical flow observation image of the last frame into a classifier to obtain a current action prediction result.

Further, the classifier is composed of a plurality of convolution layers and a full connection layer, and the action type label is the type of the action corresponding to the optical flow image.

Further, in the second step, the method specifically includes the following steps:

step two, the robot obtains an observation s at the current time t _t According to the observed RGB-D observation image, the robot predicts the action a which should be executed currently through the Policy network _t Robot execution a _t The action interacts with the environment while the environment feedback is fed to a _t Act a review and get a new observation s _t+1 Performing a new round of motion prediction, repeating the process until the robot reaches a given target position point, training a Policy network through reinforcement learning, so that the predicted motion obtains more reward and learns navigation capacity;

step two, in the Policy network, adding the last action of the optical flow branch prediction robot, and adding the last action into a past action sequence to calculate a new reward function;

step two, connecting the RGB-D image characteristic vector, the optical flow branch prediction motion embedding vector and the target position information which are extracted by the RGB-D branch convolution layer together, and inputting the connected information into a motion decision network to predict the motion which should be executed currently;

and step two, continuing training the Policy network according to a new punishment forward function to construct a relation between the current predicted action and the action in the past period of time, so that the robot reduces redundant useless rotation and swing actions.

Further, in the step two to the step two, specifically: the currently predicted actions should get additional penalties according to the degree of robot sway, i.e. the calculation formula of reward is as follows:

wherein the first term is punishment of each time step, the second term is punishment of the robot during collision, the third term is additionally introduced review term, punishment is carried out on swinging motions made by the robot, swinging_count is swinging degree, the fourth term is punishment of the robot during navigation completion according to the times of left-right turning motions in the past motion sequence, the last term is punishment of the robot during approaching distance from a target, and punishment is carried out when the robot is far away, wherein the third term is punishment of the robot during the swinging motion, the swinging_count is swinging degree, the fourth term is punishment of the robot during navigation completion, and the last term is punishment of the robot during the distance from the targetFor rewarding coefficient D _t D for distance to the target after performing the action _t-1 In order to perform the distance from the target before the action,

in the process of robot and environment interaction, continuously collecting the observation and corresponding action labels and storing the generated reorder sequence into a buffer, after a certain number of steps are collected, taking out the sequence according to a reinforcement learning policy gradient method by using the following functions, and calculating an objective function J (pi) _θ ) A gradient is obtained for updating the Policy network, so that the current robot can obtain the maximum expected benefit through the current Policy network,

wherein the A function is used for calculating the S _t Action a is performed in a state _t The function of the degree of quality, i.e. the degree to which the action can obtain a reorder, is made up of a fully connected layer, the log function is used to calculate the state s _t Lower prediction of a _t Probability of pi _θ For a policy network, θ is a parameter of the policy network, E is a desired function,

in order to solve the problem of low sampling efficiency in the policy gradient, a PPO reinforcement learning method is used in training, a policy network with fixed parameters is used for sampling the observation and corresponding action sequences and the report sequences, and then the gradient is obtained by adjustment and calculation according to the difference between the current policy network and the distribution of the policy network used in sampling, so that the expectation of calculating the current network report through the sampling network is realized, and the data utilization rate is improved, namely, the method comprises the following steps of the functions in a fractional form:

where θ' is a parameter of the current network, p _θ At s _t In state through policy network pi _θ Obtaining a _t Is a function of the probability of (1),

after repeated iterative updating, the parameters of the sampling network are updated, namely the strategy network parameters at the moment are used, and meanwhile, a punishment factor is added into the objective function, namely the last term in the formula, when the two network parameters are greatly different, a larger punishment is given, the network updating speed is controlled,

wherein KL (θ, θ) ^k ) To calculate the difference function of two network parameters, beta is the parameter that controls the update rate of the network,

when training the Policy network by reinforcement learning, the navigation capability is trained independently without adding optical flow branches, namely, using the reward except the third term in the reward function, then introducing an optical flow action classifier, freezing the parameters of optical flow and RGB-D branches and optical flow branches, and only updating the parameters of the later action decision network.

The invention has the following advantages: according to the energy-saving unmanned vehicle path navigation method capable of minimizing the useless actions, the new review is set in reinforcement learning, the optical flow information is introduced to judge the action type of the robot, and the navigation actions which the current robot should execute are predicted only through visual information by combining with other observation information and inputting the information into an action decision network, so that complex and complicated processes in the traditional SLAM method are avoided, unnecessary and useless swing actions in the decision process of the robot can be reduced, the navigation efficiency is improved, and meanwhile, the consumption of redundant energy is reduced.

Drawings

FIG. 1 is an optical flow image generated when a robot performs an action;

fig. 2 is a diagram of a Policy network architecture.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 2, wherein the image feature extraction network part is composed of two branch groups. The optical flow branches firstly predict corresponding optical flow images through RGB images and LiteFlowNet, then obtain action types through a classifier network and convert the action types into embedded vectors. The RGB-D branch extracts the feature vector through the convolution network and the full connection layer. And splicing the two vectors with the relative target to obtain a new vector which is used as the input of the action strategy network behind the network, and finally sampling the action to be executed.

Specifically, referring to fig. 1, the rightmost pixel is a color legend corresponding to a vector generated when the pixel moves in different directions, the left red image corresponds to a left turning motion of the robot, the pixel point moves to the right, and the blue color corresponds to a right turning motion. The two pictures on the left appear substantially solid because the background is stationary and only the camera is moving, so the pixels of the background are transformed in their entirety. The third is the effect of the forward motion, and the pixels are moved divergently around, so that a color image is produced according to the legend of the pixel motion. When the optical flow prediction motion is used, the motion state cannot be immediately converted due to the inertia factor of the robot, for example, the robot obtains a left turn command when performing a right turn motion, and at this time, the robot cannot immediately be converted into the left turn motion state, so that the optical flow image obtained still turns right. Although the two operation states are the same, the degree of rotation of the robot is somewhat different, and the two optical flow images are different, so that the previous optical flow image is added for prediction, and the influence of inertia is reduced as much as possible.

step two, the robot obtains an observation st at the current time t, predicts the action at which the robot should execute at present through a Policy network according to the observed RGB-D observation image, the robot executes the at action to interact with the environment, the environment feeds back to the at action for one review, a new observation st+1 is obtained, a new round of action prediction is carried out, the process is repeated until the robot reaches a given target position point, and the Policy network is trained through reinforcement learning, so that the predicted action obtains more reward and learns navigation capacity;

wherein the first term is punishment of each time step, the second term is punishment of the robot during collision, the third term is additionally introduced review term, punishment is carried out on swinging motions made by the robot, swinging_count is swinging degree, the swinging degree is obtained according to the times of left-right turning motions in a past motion sequence, the fourth term is punishment of the robot during navigation completion, the last term is punishment obtained when the distance between the robot and a target is close, and punishment is carried out when the robot is far away, wherein the first term is punishment of the robot during the navigation completion, the second term is punishment of the robot during the distance between the robot and the target, and the third term is punishment of the robot during the distance between the robot and the target distance, and the target distance between the robot and the robotFor rewarding coefficient D _t D for distance to the target after performing the action _t-1 In order to perform the distance from the target before the action,

wherein the A function is used for calculating the S _t Action a is performed in a state _t A function of the degree of quality, i.e. the degree to which the action can obtain a review, is made up of a fully connected layer, loThe g-function is used to calculate the state s _t Lower prediction of a _t Probability of pi _θ For a policy network, θ is a parameter of the policy network, E is a desired function,

in order to solve the problem of low sampling efficiency in the policy gradient, a PPO reinforcement learning method is used in training, sampling observation and corresponding action sequences and a report sequence are carried out through a policy network with fixed parameters, then the gradient is obtained by adjustment and calculation according to the difference between the current policy network and the distribution of the policy network used in sampling, so that the expectation of calculating the current network report through the sampling network is realized, and the data utilization rate is improved, namely, the method comprises the following steps of the functions in a fractional form:

Claims

1. An energy-saving unmanned vehicle path navigation method for minimizing useless actions is characterized by comprising the following steps:

setting new recall function, predicting the action to be executed by the robot to avoid obstacle according to the action sequence of the robot in the previous n step length ranges, i.e. the action predicted by the optical flow image in the previous time steps and the current observation, and reducing unnecessary and useless left-right swing steering actions while navigating, wherein n is less than or equal to 10,

in the first step, the method specifically comprises the following steps:

fifthly, inputting an optical flow image obtained after the last execution of the action and an optical flow observation image of the last frame into a classifier to obtain a current action prediction result;

in the second step, the method specifically comprises the following steps:

step two, the robot obtains an observation s at the current time t _t According to the observed RGB-D observation image, the robot predicts the action a which should be executed currently through the Policy network _t Robot execution a _t The actions interact with the environment while the ringFeedback of context to a _t Act a review and get a new observation s _t+1 Performing a new round of motion prediction, repeating the process until the robot reaches a given target position point, training a Policy network through reinforcement learning, so that the predicted motion obtains more reward and learns navigation capacity;

2. The method for navigating an energy-saving unmanned vehicle path to minimize unwanted actions according to claim 1, wherein the classifier is composed of a plurality of convolution layers and a full connection layer, and the action type label is a type of action corresponding to the optical flow image.

3. The method for navigating an energy efficient unmanned vehicle to minimize unwanted motion according to claim 1, wherein in step two to step two, specifically: the currently predicted actions should get additional penalties according to the degree of robot sway, i.e. the calculation formula of reward is as follows:

wherein the first term is punishment of each time step, and the second term is punishment of the robot during collisionThe third term is additionally introduced forward term, punishment is carried out on swinging motion of the robot, the swinging_count is swinging degree, the swinging degree is obtained according to the times of left-right turning motion in the past motion sequence, the fourth term is rewarded by the robot for completing navigation, the last term is rewarded when the robot approaches to the target distance, punishment is carried out when the robot is far away, and the robot is far away, whereinFor rewarding coefficient D _t D for distance to the target after performing the action _t-1 In order to perform the distance from the target before the action,