CN108983804A

CN108983804A - A kind of biped robot's gait planning method based on deeply study

Info

Publication number: CN108983804A
Application number: CN201810979187.2A
Authority: CN
Inventors: 吴晓光; 刘绍维; 杨磊; 张天赐; 李艳会; 王挺进
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2018-08-27
Filing date: 2018-08-27
Publication date: 2018-12-11
Anticipated expiration: 2038-08-27
Also published as: CN108983804B

Abstract

The invention discloses a kind of biped robot's gait planning methods based on deeply study, utilize the stability and flexibility of body gait, it control effectively in conjunction with deeply study to biped robot's gait, comprising the following steps: 1) establish passive biped robot's model；2) acquisition and processing of body gait data and target gait data；3) hidden feature in biped robot's gait data and body gait data is extracted respectively using noise reduction autocoder；4) body gait feature is learnt using deeply study, and then plans biped robot's gait.The present invention combines deeply study and body gait data, stable as people, the submissive walking of control biped robot.

Description

A kind of biped robot's gait planning method based on deeply study

Technical field

The present invention relates to biped robot's technical fields, and in particular to a kind of biped robot based on deeply study Gait planning method.

Background technique

The move mode of mobile robot includes crawler type, wheeled, biped etc. at present.Compared to crawler type, wheel type machine The adaptability of people, biped robot are stronger, can not only move in ground grading, but also (step can be moved in irregular environment Up and down, out-of-flatness ground running etc.).But biped robot itself is a kind of nonlinearity hybrid dynamics system, step State planning is always a difficulties.

The gait planning of biped robot, the stability in addition to keeping walking, it is necessary to consider the efficiency, soft of walking movement The problems such as pliable, environmental suitability.The gait planning method based on simplified model is commonly used in biped robot's gait planning.Base It is from the kinematics of biped robot's walking and aerodynamic point in the method for simplified model, consideration robot is mainly special It levies and is reduced to the models such as basic model, such as inverted pendulum, two connecting rods, double leval jib, be then based on these simplified models Gait planning is carried out to biped robot.Method based on simplified model has ignored the part physical feature of biped robot, makes Biped robot there are anti-interference abilities it is weak, environmentally sensitive, gait is single the problems such as.Gait planning based on intelligent algorithm Method becomes the hot spot studied at present because it has the characteristics that can to learn, adaptive, fault-tolerant high.Step based on intelligent algorithm State planing method includes neural network, support vector machine, fuzzy control, intensified learning etc..But intelligent algorithm under normal conditions It can guarantee the stabilized walking of biped robot, it cannot be guaranteed that possessing more efficient and submissive step while robot stabilized walking State, even resulting in biped robot sometimes will appear gait stiff, against convention.

Summary of the invention

The purpose of the present invention is to solve the above problem, provides a kind of biped robot's gait based on deeply study Planing method.The present invention, which utilizes, that knee biped robot model is close with manikin in structure and walking process, In conjunction with the deeply learning method under big data driving, it is weak, normal to solve the gait planning method anti-interference ability based on model The problems such as intelligent gait planning method gait is stiff is advised, the stability and flexibility when robot ambulation are improved.

To achieve the above object, the present invention is realized according to following technical scheme:

A kind of biped robot's gait planning method based on deeply study, which comprises the steps of:

Step S1: establishing biped robot's model, describes robot ambulation process；

Step S2: obtaining and handles body gait data and target gait data；

Step S3: it is extracted in biped robot's gait data and body gait data respectively using noise reduction autocoder Hidden feature；

Step S4: learning body gait feature using deeply learning method, and then plans biped robot Gait.

In above-mentioned technical proposal, step S1 specifically includes the following steps:

Step S101: establishing 4 connecting rods has knee circular arc biped robot model；Wherein, robot model include 2 thighs, 2 Shank and 2 circular arc foots, leg pass through hinge without frictionally linking together by rigid rod, and circular arc is respectively fixedly connected with enough It on shank, supporting leg and leads leg with identical quality and geometric parameter, and the uniform quality distribution of leg, robot mould Position-limit mechanism is set at the knee joint of type to simulate the knee joint function of human body, two motors are set in hip joint, respectively to branch It support leg and leads leg and applies control moment；

Step S102: model gait processes are divided using the right side of direction of advance during robot ambulation as viewpoint Analysis selects the dimensionless physical quantity of real-time characterization robotary, selected physical quantity is defined as robot ambulatory status Θ_r, robot ambulatory status is described as:

Wherein, rotation counterclockwise is taken to be positive, θ_r1,For lead leg shank to vertical direction angle and angular speed；θ_r2,For lead leg thigh to vertical direction angle and angular speed；θ_r3,For supporting leg shank to vertical direction angle and Angular speed.

In above-mentioned technical proposal, step S2 specifically includes the following steps:

Step S201: human body and robot are swung into the process led leg and collided with ground from leading leg and are defined as One gait cycle；

Step S202: human normal walking process data set is chosen from CMU human sports trapped data library, by data set It carries out human body division and resolves, obtain human locomotion process description；

Step S203: using robot model as reference, taking the 2D plane of human locomotion longitudinal direction, defines human locomotion state and is Θ_m, all data in human locomotion process description are used into Θ_mIt is indicated, and by Θ_mAs row vector, combination obtains people Body gait data Θ_M；

Step S204: from body gait data Θ_MThe middle learning object for choosing a gait cycle as robot, is extracted Odd-numbered frame in learning object data forms new data set, and is defined as target gait data Θ_S；Wherein, target gait number According to Θ_sIn any row vector be Θ that extraction obtains_m；

Step S205: by ambulatory status Θ of the robot in gait cycle_rAccording to Θ_sIn sample frequency sampled, Form robot gait data Θ_R.Wherein, robot gait data Θ_RIn any row vector be to sample obtained Θ_r。

In above-mentioned technical proposal, step S3 is specifically included: according to Θ_r、Θ_mData structure, it is identical to construct two structures Noise reduction autocoder, to robot gait data Θ_RWith target gait data Θ_SCarry out feature extraction.By Θ_R、Θ_SRow Vector is sent into noise reduction autocoder one by one, and obtained feature is arranged by original sequence, forms robot gait characteristic According to H_RWith target gait feature data H_S, by H_RAnd H_SIt is uniformly normalized for use in deeply study, wherein often A noise reduction autocoder workflow following steps:

S301: Θ is taken_ROr Θ_SMiddle a line vector theta is sent into noise reduction autocoder, and noise reduction autocoder uses binomial point Cloth carries out selective erasing to original gait data Θ, is erased data and sets 0, obtains noise-containing gait dataPass through coding Function f willIt is mapped to hidden layer, obtains hidden layer feature h, wherein the coding function of noise reduction autocoder are as follows:

Wherein, w is the weight matrix of input layer and hiding interlayer；s_fFor the activation primitive of coding function f, activation primitive is taken Sigmod function；

S302: hidden layer feature h is mapped to output layer by decoding functions g, obtains reconstruct output y；Reconstruct output y is kept The information of original gait data x, global error pass through whole loss function J_DAEIt indicates, wherein noise reduction autocoder Decoding functions are as follows:

Wherein,For the weight matrix of hidden layer and output interlayer, and haves_gFor decoding functions activation primitive, It is similarly Sigmod function；The whole loss function of noise reduction autocoder in given training set:

Wherein θ_DAEIt is the parameter of noise reduction autocoder, including w, p, q；L is defined as reconstructed error, for portraying y and Θ Degree of closeness:

Wherein n is the dimension of input and output layer；

S303: noise reduction autocoder training process is declined using gradient to J_DAE(θ) is iterated calculating to obtain minimum Value, gradient decline to θ_DAERenewal function:

Wherein α is learning rate, and value is [0,1].

In above-mentioned technical proposal, in step S4, DDPG is as biped robot for selected depth deterministic policy gradient algorithm Learning algorithm, the robot gait characteristic H that noise reduction autocoder is handled_RIt is calculated as depth deterministic policy gradient The input data s of method_t, target gait feature data H_SAs r_tCalculation basis, and it is defeated by depth deterministic policy gradient algorithm Motor executes torque a out_t；Robot acquires the data of different gaits in continuous walking process, is supplied to depth certainty The training of Policy-Gradient algorithm, the energy for finally making depth deterministic policy gradient algorithm that there is control robot to reach target gait Power.

In above-mentioned technical proposal, wherein the tactful network of depth deterministic policy gradient algorithm uses 5 layers of convolutional Neural net Network, wherein respectively including input layer, two layers of convolutional layer, full linking layer, output layer, wherein input layer is for receiving s_t, output layer The torque a that output motor needs to be implemented_t。

Compared with prior art, the present invention having the following beneficial effects:

Deeply is learnt to combine with body gait data by the present invention, solves the gait planning method based on model The problems such as anti-interference ability is weak, normal procedure intelligent gait planning method gait is stiff.The introducing of noise reduction autocoder is both to gait Feature in data is extracted, and eliminates the influence of disparity and noise.DDPG comes compared to conventional intensified learning It says, can spend less time and solve the problems, such as more complicated, reach higher control requirement.By target gait feature data H_S As r_tCalculation basis, allow DDPG effectively using body gait data, so that r_tBoth the steady of robot gait had been had rated It is qualitative, and have rated its flexibility.By training, DDPG may finally control equally stable, the submissive walking of machine portrait people.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is planing method schematic diagram of the invention；

Fig. 2 is that 4 connecting rods have knee circular arc biped robot model schematic；

Fig. 3 is biped robot's walking schematic diagram；

Fig. 4 is the 2D process schematic of the human locomotion obtained by human body movement data library；

Fig. 5 is the operational process schematic diagram of noise reduction autocoder DAE；

Fig. 6 is the structure chart of depth deterministic policy gradient algorithm DDPG；

Fig. 7 is the training flow diagram of depth deterministic policy gradient algorithm DDPG.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.

Fig. 1 is planing method schematic diagram of the invention；As shown in Figure 1, of the invention a kind of based on deeply study Biped robot's gait planning method, comprising:

Step S1: establishing biped robot's model, describes robot ambulation process；Wherein step S1 specifically includes following step It is rapid:

Step S101: establishing 4 connecting rods has knee circular arc biped robot model；

Step S102: model gait processes are divided using the right side of direction of advance during robot ambulation as viewpoint Analysis selects the dimensionless physical quantity of real-time characterization robotary, selected physical quantity is defined as robot ambulatory status Θ_r。

Specifically, biped robot's model that the present embodiment is established, which belongs to 4 connecting rods, knee circular arc biped robot model, mould Type is as shown in Figure 2.Robot is made of enough 2 thighs, 2 shanks and 2 circular arcs.Leg is by rigid rod by hinge without rubbing It links together with wiping, circular arc is respectively fixedly connected on shank enough.Supporting leg in model with lead leg with identical Quality and geometric parameter, and leg uniform quality distribution.Position-limit mechanism is set at machine human knee joint to simulate the knee of human body Function of joint.Two motors are set in hip joint, to supporting leg and can lead leg respectively and apply control moment.

The present embodiment only carries out 2D modeling to the right side of direction of advance during robot ambulation.The walking one of robot The process of step is as shown in figure 3, can be described as:

Stage I: handstand pendular motion is done in the locking of robot supporting leg knee joint, between supporting leg and ground without opposite sliding and Displacement；The knee joint that robot leads leg loosens, and swings forward, and hip joint travels forward.

Stage II: leading leg for robot is swung to before supporting leg, reaches big when maximum bends and stretches state when leading leg Leg and shank are because position-limit mechanism collides, and the collision process is instantaneously completed, and position-limit mechanism locks after collision, and keeps locking Determine state.

Stage III: leading leg for robot is dynamic to backswing relative to supporting leg, and hip joint still travels forward.

Stage IV: with ground instant shock occurs for leading leg for robot, collides instantaneous completion and nothing bounces；Supporting leg and The role that leads leg swaps.

In entire walking process, robot ambulatory status can be described as in real time:

Step S2: obtaining and handles body gait data and target gait data.

Step S2 specifically includes the following steps:

Step S203: using robot model as reference, taking the 2D plane of human locomotion longitudinal direction, defines human locomotion state and is Θ_m.All data in human locomotion process description are used into Θ_mIt is indicated, and by Θ_mAs row vector, combination obtains people Body gait data Θ_M；

Step S204: from body gait data Θ_MThe middle learning object for choosing a gait cycle as robot, is extracted Odd-numbered frame in learning object data forms new data set, and is defined as target gait data Θ_S；

Step S205: by ambulatory status Θ of the robot in gait cycle_rAccording to Θ_sIn sample frequency sampled, Form robot gait data Θ_R。

Specifically, it in the present embodiment, in order to allow biped robot to learn body gait, needs to catch using human motion It obtains technology and provides target gait data for robot.The quality of gait data will have a direct impact on the final study effect of robot Fruit, so its reliability is particularly important in the present embodiment.Reliable gait data can pass through people more well-known both at home and abroad Body motion capture database is obtained, the open source human body motion capture data that these databases provide, by numerous researchers It is used, accuracy and reliability with higher.

In the present embodiment, using the open source human sports trapped data library of Carnegie Mellon University CMU graph experiment room, The laboratory in the rectangular room of 3m*8m using 12 thermal cameras with the data of the image recording human motion of 120Hz, And by data creating at the file of standard.It, can be according to adult human body's inertial parameter index from the data of the selection in database Human body in gait data is divided into 16 parts, then filters out low frequency spur using filtering method, and combine Human Physiology The multiple regression equation of structure derives the data such as the density, inertial tensor, rotary inertia and centroid position of every section of limbs.

Human walking motion is the process participated in jointly by large number of joint freedom degrees, even if by these joints freedom Degree is artificially divided, and quantity is also excessive for robot.In order to make robot and human body have one on gait data Cause property needs to simplify the joint freedom degrees participated in during human locomotion.

There is knee biped robot model as reference using what is used, take the 2D plane of human locomotion longitudinal direction, by human locomotion shape State is defined as:

Wherein, rotation counterclockwise is taken to be positive, θ_m1,For lead leg shank to vertical direction angle and angular speed；θ_m2,For lead leg thigh to vertical direction angle and angular speed；θ_m3,Angle and angle for support shank to vertical direction Speed.

Human normal walking process data set is chosen from CMU human sports trapped data library, and data set is subjected to human body It divides and resolves, obtain the 2D process of human locomotion, as shown in Figure 4.The present embodiment by data set gained gait processes data Θ is defined as final body gait data according to the simplification that human locomotion state defines progress data_M, wherein arbitrarily Row vector is to extract obtained Θ_m。

Human body and robot are swung to from leading leg the process led leg and collided with ground and are known as one by the present embodiment A gait cycle.From body gait data Θ_MThe middle learning object for choosing 1 gait cycle as robot, it is contemplated that motor Time required for the variation of torque, the odd-numbered frame extracted in learning object data forms new data set, and is defined as target Gait data Θ_s.By gait data of the robot in gait cycle according to Θ_sSample frequency sampled, form robot Gait data Θ_R, wherein any vector is the Θ that sampling obtains_r, work as Θ_RWith Θ_SThen used when dimension difference resize method into Row processing keeps its identical.

Step S3: it is extracted in biped robot's gait data and body gait data respectively using noise reduction autocoder Hidden feature；Step S3 is specifically included: according to Θ_r、Θ_mData structure, construct the identical noise reduction autocoding of two structures Device, to robot gait data Θ_RWith target gait data Θ_SCarry out feature extraction.By Θ_R、Θ_SRow vector be sent into one by one Noise reduction autocoder, and obtained feature is arranged by original sequence, form robot gait characteristic H_RIt is walked with target State characteristic H_S, by H_RAnd H_SIt is uniformly normalized for use in deeply study.

Since there is differences for the geometric parameter between human body and robot, and consider wide usage and step of the invention Noise present in state data.The present embodiment does into one robot and body gait data using noise reduction autocoder DAE Step processing, to extract from existing gait data and encode out more robust feature, while eliminating model parameter and noise Influence, allow the robot to preferably learn body gait.

DAE is the innovatory algorithm based on autocoder, and structure is simple, arithmetic speed is fast, is usually used in deep learning net The advance data of network is handled, and can be extracted from known data, be encoded out more robust feature, and eliminate that may be present Influence of noise.

The DAE that the present embodiment uses is a kind of single saphenous nerve network.Be made of three-layer network: first layer is input layer, is used In the original gait data of reception, and noise is added and obtains adding data of making an uproar；The second layer is hidden layer, and DAE will make an uproar to adding in this layer Data are encoded, and coding result can be considered the hidden feature of original gait；Third layer is output layer, this layer will be in hidden layer Hidden feature is decoded reconstruct, and DAE completes the reconstruct output after training should be identical as original gait data.DAE passes through gradient Decline is updated network parameter.

DAE is hidden when being originally inputted the loss function very little of x and reconstruct output y composition by training adjustment network parameter Hiding layer output can be considered as a kind of expression for being originally inputted x, and this expression is known as inputting the feature of x, and this feature can be used as original The good representation of beginning input signal.Noise is added further through to training data in DAE, and hidden layer is made to must be learned by removal noise and complete The original gait information of whole expression, forcing DAE to go study with this, input signal is more robust indicates.DAE workflow such as Fig. 5 It is shown, with robot gait data Θ_RFor, it can be described as:

S301: Θ is taken_RMiddle a line vector theta_rIt is sent into DAE.DAE is using bi-distribution to original gait data Θ_rIt carries out Selective erasing is erased data and sets 0, obtains noise-containing gait dataIt will by coding function fIt is mapped to hiding Layer, obtains hidden layer feature h, wherein the coding function of noise reduction autocoder are as follows:

S302: hidden layer feature h is mapped to output layer by decoding functions g, obtains reconstruct output y；Reconstruct output y is kept The information of original gait data x, to guarantee that hidden layer feature h characterizes original gait data, the global error of reconstruct output y passes through Whole loss function J_DAEIt indicates, wherein the decoding functions of DAE:

Wherein,For the weight matrix of hidden layer and output interlayer, and haves_gFor decoding functions activation primitive, together Sample is Sigmod function；The whole loss function of DAE in given training set:

Wherein θ_DAEIt is the parameter of DAE, including w, p, q；L is defined as reconstructed error, for portraying y and Θ_rClose to journey Degree:

Wherein n is the dimension of input and output layer；

S303:DAE training process is declined using gradient to J_DAE(θ) is iterated calculating to obtain minimum value, gradient decline To θ_DAERenewal function:

Wherein α is learning rate, and value is [0,1].

The present embodiment constructs the identical DAE network of two structures, DAE_RAnd DAE_M, and robot gait data are used respectively Θ_RWith body gait data Θ_MIt is trained.By the DAE of mass data training_RAnd DAE_MThe machine described in the present embodiment Device people and body gait data carry out hidden feature extraction, and define the robot after extracting, body gait feature is h_r、h_m.Make Use DAE_MTo Θ_SEach row vector carry out feature extraction, and by the feature after extraction by original sequence arrange, obtain target walk State characteristic H_S.In kind handle Θ_R, obtain robot gait characteristic H_R.By H_S、H_RPlace is uniformly normalized It manages and is supplied to deeply study and operated.H_S、H_RThe feature of effective characterization robot and body gait data, can Effectively reduce noise and geometric parameter difference influences to deeply study bring.

Step S4: learning body gait feature using deeply learning method, and then plans biped robot Gait.Intensified learning is a Main Branches of machine learning, can gradually be improved during intelligent body and environmental interaction The movement of intelligent body selects, and is finally reached control intelligent body and completes target.Intensified learning does not need accurate agent model, because This is very suitable for control biped robot.But conventional intensified learning convergence rate is slow, and nerual network technique is combined to improve Intensified learning, because the sample that acquires is highly relevant in time in interactive process, be discontented with although improving pace of learning The sample that sufficient neural metwork training requires is independent, therefore network easily over-fitting.With the rapid development of deep learning, depth is strong Chemistry is practised and being begun to appear within the sight of researcher.Deeply study is the knot of conventional intensified learning and deep learning Close, using deep learning theory the deficiency of intensified learning is supplemented so that intensified learning in all respects in have Great promotion.

According to the feature that biped robot's walking movement is continuous and hip joint driving motor motion space is continuous, the present embodiment Learning algorithm of the selected depth deterministic policy gradient algorithm DDPG as robot.DDPG is based on deterministic policy gradient The improved Actor-Critic structure algorithm of DPG, using neural network respectively instead of the strategic function in conventional intensified learning And cost function, the neural network after substitution are referred to as tactful network μ and Q network Q.Tactful network receiver device people state is simultaneously Motor torque is returned to, Q network then combines robotary and motor torque to evaluate the selection of tactful network, DDPG frame As shown in Figure 6.

In step S4, the robot gait characteristic H by the processing of noise reduction autocoder is used_RIt is determined as depth The input data s of property Policy-Gradient DDPG_t, target gait feature data H_sAs r_tCalculation basis, and by depth certainty plan Slightly gradient DDPG output motor executes torque a_t；Robot acquires the data of different gaits in continuous walking process, provides Depth deterministic policy gradient DDPG training is given, finally makes depth deterministic policy gradient DDPG that there is control robot to reach mesh Mark the ability of gait.

In order to solve caused by the sample acquired in interactive process in time highly relevant network oscillation and excessively quasi- It closes, the present embodiment is that DDPG is provided with a memory pond.Pond is remembered by the robotary s in one gait cycle of robot_t、 Select the motor torque a executed_t, obtained reward r_tWith the robotary s after motor driven_t+1, as one group of experience (s_t, a_t, r_t, s_t+1) stored.When needing to carry out neural metwork training, n group experience is randomly selected from memory pond as training Data, the size of n are generally set by small lot data (minibatch).The mechanism randomly selected both had been upset between sample Temporal correlation prevents network oscillation and over-fitting, and robot is allowed to learn pervious experience and present simultaneously Experience.

Evaluation training is carried out if only using single Q Network Countermeasure in DDPG and omiting network, will lead to learning process not Stablize, reason is that the network parameter of single Q network while frequent updating, and be used to calculate Q network and policy network The gradient of network.Therefore, the present embodiment replicates the tactful network μ and Q network Q after initialization network parameter in DDPG, after duplication Obtained new network is referred to as off-line strategy network μ ' and offline Q network Q ', and legacy network is then known as strategy of on-line mesh network Network μ and online Q network Q.Online network portion is used to export the movement of robot, carries out when walking for robot.And it is offline The main function of network is to provide data supporting for the training of online network, so that whole network is more stable, quick convergence.

The network structure of network and offline network is completely the same online, and the difference of the two is that its network parameter updates Mode.The network parameter of online network update the data provided using experience, the offline network extracted at random from memory pond and Stochastic gradient descent is updated.The network parameter update of offline network is then carried out by soft update.It is soft to update from online network Network parameter in obtain updating according to update to offline network is completed, be with strategy of on-line network and off-line strategy network Example, soft update may be expressed as:

θ^μ′=τ θ^μ+(1-τ)θ^μ′

Wherein, θ^μAnd θ^μ′The respectively network parameter of strategy of on-line network and off-line strategy network, the general value 0.001 of τ. Similarly, online Q network and internetwork soft update of offline Q are all above formula.

Training process flow chart in this example is as shown in fig. 7, training process can be described as:

The selection and target gait data that robot sample frequency and sampling start time are led leg are consistent.T moment Robotary is Θ_R, will be by DAE treated robot gait characteristic H_RAs s_t, target gait feature data H_S As r_tFoundation.

S401: the tactful network in the present embodiment, DDPG is 5 layers of CNN network: first layer is as input layer, for connecing Receive s_t；The second layer and third layer are convolutional layer；4th layer is full linking layer；Layer 5 is output layer, set action maximum boundary , and the torque that output motor needs to be implemented,.Q network structure is roughly the same with tactful network structure, will only input layer unit Number increases to accommodate motor torque a_t, and output layer unit is set as 1 and is returned only to evaluation.

The network parameter of online network is subjected to random initializtion, and the network parameter after initialization is copied to corresponding Offline network.Setting memory pond maximum can store experience number E, and neural metwork training data set size minibatch, nerve is arranged Network single frequency of training T initializes strategy of on-line e-learning rate l_poilcy, online Q e-learning rate l_Q, soft update The primary maximum step number W of interaction walking is arranged in rate τ.Maximum step number W is fallen or completed when robot be considered as primary complete friendship Mutually, it is denoted as epi, and maximum interaction times are EPI.Finally, robotary is carried out random initializtion.

S402: moment robotary of having led leg is s_t, strategy of on-line network is according to one group of electricity of current network output Machine power square a_t, can indicate are as follows:

a_t=μ (s_t|θ^μ)

Wherein, a_tRow vector be respectively supporting leg motor and the execution torque of motor of leading leg, columns and s at hip joint_t Unanimously.

S403: in swing process of leading leg, two motors execute corresponding motor torque respectively at robot hip joint, a_tThe execution time of any row vector is identical as sampling interval duration.A is first carried out in motor_tThe first row, to robotary θ_rTorque is executed after the completion of sampling and switches to next line, is executed with this sequence.The present embodiment control moment user's wave power square, can Effectively to avoid the generation shaken in control process.When robot lead leg with ground collide when, update step number counting w, By all θ for sampling and obtaining_rIt is sent into DAE_R, obtain the new state s of robot_t+1。

S404: the design of reward function is a particularly significant step in deeply study and work, good reward design The effect of study can be obviously improved.The present embodiment application plan reward design to aiming drill faster, r_tIt is as follows:

When robot does not fall, s_t+1With H_sThe smaller r of gap_tIt is bigger and permanent greater than 0.When robot falls, r_t=- 1, such guided robot approaches target gait under the premise of keeping one's legs.

S405: by (s_t, a_t, r_t, s_t+1) as one group of experience deposit memory pond, it updates memory pond experience quantity and counts exp. According to different conditions, each counter has following different operations: 1) resetting robotary if robot falls and return to execution S402 simultaneously resets w；2) if do not fall but w < W if by s_t+1As new s_tExecute S402；3) it is sequentially held if w >=W and exp > E Row S406 simultaneously resets w；4) otherwise resetting robotary returns to execution S402 and resets w.When execution S401, S403, S404 more New epi.

S406: minibatch group experience is randomly selected from memory pond, the training dataset as online network.

S407: the s that training data is concentrated is extracted_t, a_t, it is sent into online Q network and is evaluated: Q (s_t, a_t|θ^Q).By data set In s_t+1It is sent into off-line strategy network and obtains motor torque a '_t+1, by offline Q network to s_t+1、a′_t+1It is evaluated: Q ' (s_t+1, μ′(s_t+1|θ^μ′)|θ^Q′).The loss function of so online Q network can indicate are as follows:

Wherein, y_i=r_t+γQ′(s_t+1, μ ' (s_t+1|θ^μ′)|θ^Q′).According to L_QUsing stochastic gradient descent to online Q network It is updated.

S408: the Policy-Gradient in calculative strategy network, the loss function definition of strategy of on-line network:

L_μ=Q (s_t, μ (s_t, θ^μ)|θ^Q)

The loss function of Utilization strategies network can calculate the gradient of strategy of on-line network:

Equally strategy of on-line network parameter is updated using stochastic gradient descent.

S409: by strategy of on-line network and after the network parameter of line Q network all is completed to update, by soft update to offline Tactful network and offline Q network are updated:

S410: updating network training number times, when times is greater than single frequency of training T, then this network training knot Beam executes S411.Otherwise it returns to S406 and continues network training.

S411: as epi > EPI, DDPG calculating terminates, and saves online policy network network as controller.If epi < EPI, It resets robotary and returns to execution S402.

During robot ambulation, constantly using DDPG carry out learning training, until strategy of on-line network μ with Line Q network Q convergence reaches maximum interaction times EPI.When network convergence in DDPG, strategy of on-line network can be to random The robot of initial gait is controlled, until reaching target gait.Equally, if in the process of walking by outer force-disturbance, The first step is considered as initial gait and is control effectively using DDPG after can also being disturbed, and is only not in this way in robot Tumble state, DDPG can be carried out effectively controlling.Target gait feature data H_SUse, be r_tCalculating provide according to According to enabling r_tThe stability and flexibility of robot gait can be described simultaneously.The present embodiment by deeply learn and human-step State data combine, and make robot that may finally obtain stable, submissive gait as people.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make a variety of changes or modify within the scope of the claims, this not shadow Ring substantive content of the invention.In the absence of conflict, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims

1. a kind of biped robot's gait planning method based on deeply study, which comprises the steps of:

Step S2: obtaining and handles body gait data and target gait data；

Step S3: it is extracted respectively using noise reduction autocoder implicit in biped robot's gait data and body gait data Feature；

Step S4: body gait feature is learnt using deeply study, and then plans biped robot's gait.

2. a kind of biped robot's gait planning method based on deeply study according to claim 1, feature Be, step S1 specifically includes the following steps:

Step S101: establishing 4 connecting rods has knee circular arc biped robot model；Wherein, robot model includes 2 thighs, 2 shanks And 2 circular arc foots, leg pass through hinge without frictionally linking together by rigid rod, circular arc is respectively fixedly connected with enough in shank On, it supporting leg and leads leg with identical quality and geometric parameter, and the uniform quality distribution of leg, robot model's Position-limit mechanism is set at knee joint to simulate the knee joint function of human body, two motors are set in hip joint, respectively to supporting leg Apply control moment with leading leg；

Step S102: analyzing model gait processes using the right side of direction of advance during robot ambulation as viewpoint, The dimensionless physical quantity for selecting real-time characterization robotary, is defined as robot ambulatory status Θ for selected physical quantity_r, Robot ambulatory status is described as:

Wherein, rotation counterclockwise is taken to be positive, θ_r1,For lead leg shank to vertical direction angle and angular speed；θ_r2,For Lead leg thigh to vertical direction angle and angular speed；θ_r3,For the angle of supporting leg shank to vertical direction and angle speed Degree.

3. a kind of biped robot's gait planning method based on deeply study according to claim 2, feature Be, step S2 specifically includes the following steps:

Step S202: human normal walking process data set is chosen from CMU human sports trapped data library, data set is carried out Human body is divided and is resolved, and obtains human locomotion process description；

Step S203: using robot model as reference, the 2D plane of human locomotion longitudinal direction is taken, definition human locomotion state is Θ_m, All data in human locomotion process description are used into Θ_mIt is indicated, and by Θ_mAs row vector, combination obtains human-step State data Θ_M；

Step S204: from body gait data Θ_MThe middle learning object for choosing a gait cycle as robot, extracts study Odd-numbered frame in object data forms new data set, and is defined as target gait data Θ_S, wherein target gait data Θ_S In any row vector be Θ that extraction obtains_m；

Step S205: by ambulatory status Θ of the robot in gait cycle_rAccording to Θ_SIn sample frequency sampled, form Robot gait data Θ_R, wherein robot gait data Θ_RIn any row vector be to sample obtained Θ_r。

4. a kind of biped robot's gait planning method based on deeply study according to claim 3, feature It is, step S3 is specifically included: according to Θ_r、Θ_mData structure, construct the identical noise reduction autocoder of two structures, it is right Robot gait data Θ_RWith target gait data Θ_SCarry out feature extraction；By Θ_R、θ_SRow vector be sent into one by one noise reduction from Dynamic encoder, and obtained feature is arranged by original sequence, form robot gait characteristic H_RWith target gait feature Data H_S, by H_RAnd H_SIt is uniformly normalized for use in deeply study, wherein each noise reduction autocoder work Make process following steps:

S301: Θ is taken_ROr θ_SMiddle a line vector theta is sent into noise reduction autocoder, and noise reduction autocoder uses bi-distribution pair Original gait data Θ carries out selective erasing, is erased data and sets 0, obtains noise-containing gait dataPass through coding function F willIt is mapped to hidden layer, obtains hidden layer feature h, wherein the coding function of noise reduction autocoder are as follows:

S302: hidden layer feature h is mapped to output layer by decoding functions g, obtains reconstruct output y；Reconstruct output y keeps original The information of gait data x, global error pass through whole loss function J_DAEIt indicates, the wherein decoding of noise reduction autocoder Function are as follows:

Wherein,For the weight matrix of hidden layer and output interlayer, and haves_gFor the activation primitive of decoding functions, equally For Sigmod function；The whole loss function of noise reduction autocoder in given training set:

Wherein θ_DAEIt is the parameter of noise reduction autocoder, including w, p, q；L is defined as reconstructed error, for portraying connecing for y and Θ Short range degree:

Wherein n is the dimension of input and output layer；

S303: noise reduction autocoder training process is declined using gradient to J_DAE(θ), which is iterated, to be calculated to obtain minimum value, Gradient declines to θ_DAERenewal function:

Wherein α is learning rate, and value is [0,1].

5. a kind of biped robot's gait planning method based on deeply study according to claim 3, feature It is, in step S4, learning algorithm of the selected depth deterministic policy gradient algorithm DDPG as biped robot, certainly by noise reduction The robot gait characteristic H of dynamic coder processes_RInput data s as depth deterministic policy gradient algorithm_t, target Gait feature data H_SAs r_tCalculation basis, and by depth deterministic policy gradient algorithm output motor execute torque a_t；Machine Device people acquires the data of different gaits in continuous walking process, is supplied to the training of depth deterministic policy gradient algorithm, most The ability for making depth deterministic policy gradient algorithm that there is control robot to reach target gait eventually.

6. a kind of biped robot's gait planning method based on deeply study according to claim 5, feature It is, wherein the tactful network of depth deterministic policy gradient algorithm uses 5 layers of convolutional neural networks, wherein respectively including inputting Layer, two layers of convolutional layer, full linking layer, output layer, wherein input layer is for receiving s_t, power that output layer output motor needs to be implemented Square a_t。