CN109948054A

CN109948054A - A kind of adaptive learning path planning system based on intensified learning

Info

Publication number: CN109948054A
Application number: CN201910202413.0A
Authority: CN
Inventors: 刘丽萍; 吴文峻
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-06-28
Also published as: CN110569443B; CN110569443A

Abstract

The present invention relates to a kind of adaptive learning path planning system based on intensified learning, including environmental simulation, three modules of Strategies Training and path planning, the ability value at student's each moment that whole process is obtained according to the improved project principle of reflection, based on markov decision process, simulate complicated academic environment, it is online finally student's Adaptive Planning learning path according to trained strategy and rationally using the study track off-line training Path Planning of algorithm combination student's history of intensified learning.Present invention is finally based on the thoughts of intensified learning, the complex scene learnt on online education platform is implemented in the frame of markov decision process, target is promoted to efficient capacitation, the duration for providing education resource for student is recommended, optimal learning path is planned, to improve the learning effect and learning efficiency of learner.

Description

A kind of adaptive learning path planning system based on intensified learning

Technical field

The present invention relates to a kind of adaptive learning path planning system based on intensified learning, belongs to Computer Applied Technology Field.

Background technique

With becoming increasingly popular for online education, various E-learning resources, including e-book are can be used in student, are practiced after class It practises and learns video, in view of the background of student, the diversity and otherness of mode of learning and know-how, online education platform are needed Personalized education resource recommendation tool is introduced, to facilitate student to select the learning path of oneself, it is personalized to meet them Learning demand.

Existing individualized learning resource recommendation algorithm, can be divided into two classes substantially, and rule-based recommendation and data are driven Dynamic recommendation, most of intelligence instruction systems (Intelligent Tutoring System, ITS), mostly uses greatly rule-based Method carry out the recommendation of education resource, this needs domain expert just to assess the study scene of different type student, and fixed Adopted corresponding extensive recommendation rules.It will be apparent that and this labor-intensive method can be only applied to specific learning areas, can Scalability is not strong.For modern extensive on-line education system, designer generallys use the recommended method of data-driven, such as assists Study proposed algorithm is realized with filter method.The proposed algorithm of these data-drivens attempts by comparing student and learning object Between similitude to recommend suitable education resource for student.

Although the recommended method of data-driven has more scalability and versatility than rule-based method, at present Some solutions realize to student carry out adaptive learning resource recommendation in terms of all there is a problem that it is identical, i.e., often only It can be according to the perhaps learning behavior of student in education resource, to retrieve the education resource or similar learning behavior of Similar content Student group in, not in view of education resource difficulty and student's learning state dynamic changes influence.

Based on it is presently recommended that algorithm present Research, traditional proposed algorithm such as collaborative filtering, hidden semantic model etc., mainly It is intended for commercial product recommending or the distribution from media content, main target is to guess the hobby of user, recommends its interested for user Commodity or content all more lay particular emphasis on the calculating of similitude whether in terms of customer-side or content；And it is provided towards study The recommendation in source, what is more valued is that education resource can be for the capability improving of student's bring, this is not simple in conventional recommendation algorithm What the calculating based on similitude can be accomplished, and the promotion of student ability be it is procedural, it is not stranghtforward, among these Just it has been related to the planning of learning path.Thus the present invention proposes a kind of adaptive learning path planning side based on intensified learning Method effectively solves the above problems, and makes the strategy of the maximum most fast capability improving of student's acquisition.

Summary of the invention

Technology of the invention solves the problems, such as: overcoming the deficiencies of the prior art and provide a kind of based on the adaptive of intensified learning The complex scene learnt on online education platform is implemented in Ma Er by learning path planning system, the thought based on intensified learning In the frame of section's husband's decision process, target is promoted to efficient capacitation, the duration for providing education resource for student is recommended, It plans optimal learning path, can be improved the learning effect and learning efficiency of learner.

A kind of technical solution of the invention: adaptive learning path planning system based on intensified learning, comprising: packet Include environmental simulation module, Strategies Training module and path planning module.

Environmental simulation module, realize complicated online learning environment is converted to machine it will be appreciated that language and text Word；Based on student in the essential information of history learning record and education resource on on-line study platform, according to improved The project principle of reflection, formalization obtain the five-tuple of markov decision process；

Strategies Training module realizes function of the off-line training based on the Path Planning under each capability state；According to The five-tuple for the markov decision process that environmental simulation module obtains, using the Q_learning algorithm based on intensified learning, Off-line training is obtained based on the Path Planning under each capability state；

Path planning module realizes as the function of target student's real-time perfoming path planning；According to Strategies Training module Obtained strategy obtains the Optimal Learning path planned in real time for target student based on the current capability state of target student. It is finally reached the target for improving learning effect and efficiency.

The environmental simulation module step is as follows: environmental simulation is needed based on markov decision process, by it is complicated Line study scene form turns to the five-tuple<S, A, T of markov decision process, and R, γ>；

(11) S indicates that state will be learned according to the ability value at student's each moment that the improved project principle of reflection obtains Raw ability value divides ability value area in student's quantity normal distribution ratio as state S, by every one-dimensional student ability value Between, five sections are divided according to the distribution proportion of student's quantity 1:2:5:2:1, each section takes energy of the mean value as the section Force value；

(12) A expression acts, and refers to the behavior set that intelligent body can be taken, and in the environment of online education, as student can With the resource collection of study；

(13) T indicates state transition probability, and the student after being divided based on the state demarcation in (11), and a large amount of abilities is learned Habit behavior path data, statistics calculate state transition probability T；

T (s, a, s')=P (s_t=s'| s_t=s, a_t=a)

(14) R indicates that award, award are divided into award immediately and accumulation award；

Immediately award is applied to the learning process of student, it will be appreciated that shifts after the state s moment has learnt resource a for student To state s ', instant reward value r (s, a, s ') can be obtained, the reward value is related with following three factor:

P (T): being correctly completed probability, and student can be correctly completed the probability of education resource a under the moment ability value, It is predicted based on learning effect assessment models.

F (T): correctly shifting the frequency, all in student path that the sample that a is transferred to state s ' is displaced through from state s, The probability that transfer is wherein completed and being correctly completed education resource, may be expressed as:

·Diff(s₁,s₂)=(s '-s) difficulty_a, before the maximal increment of conversion front and back ability is expressed as ability The dot product of difference value vector and education resource difficulty afterwards, it is therefore an objective to match the ability value of student and the difficulty of education resource, and By vector scalarization, convenient for award calculating compared with.

Award may be expressed as: immediately as a result,

R (s, a, s')=ω × Diff (s, s')

ω=P (T) × F (T)+(1-P (T)) × (1-F (T))

Wherein, coefficient of the ω as maximum capacity increment, it is therefore an objective to according to student ability and known sample distribution, difference Changing big maximum capacity increment, student can be from the growth for being correctly completed capacitation in education resource, and vice versa is trained, than If student answers after certain wrong problem according to feedback sense to the knowledge point wherein contained, for student also it is a kind of at It is long.Such representation also maintains the consistency of P (T) Yu F (T).

Accumulation award

Accumulation award (Return, G), is also referred to as returned, and is defined as awarding a certain specific function of sequence, if t Award sequence after step is R_t+1,R_t+2,R_t+3,…R_T, T is total step-length, then returns G and can be expressed simply as each step and encourage immediately The sum of reward:

But since the path length of student is not quite similar, if only to find cumulative maximum award for target, with student road The growth of electrical path length, G value also can be increasing, and this and do not meet and recommend optimal and shortest path target herein for student, Therefore discount factor should be added herein, to weaken the influence of future returns.

(15) γ indicates discount factor, and in the expression of above-mentioned calculating accumulation award, γ ∈ [0,1] will be equivalent to future Return make a discount, if γ levels off to 0, be only concerned about current instant award, often executing makes current award immediately most Big behavior, essence are a kind of greedy behaviors；It, can more consideration is given to future returns if γ levels off to 1.

The Strategies Training module step is as follows:

(21) five-tuple<S of the obtained markov decision process in storage environment simulation steps, A, T, R, γ>；

(22) an initial capability state S is randomly choosed from competence set S₁；

(23) based on ε-greedy strategy in S₁Resource A has been selected under capability state₁Learnt, has learnt A₁Later, root According to the next capability state S of environment Observable₂, while having obtained instant award R₂(consummatory behavior strategy), selects current energy at this time Maximum Q value is to update Q function (completing target strategy) under power state:

Q_k+1(S₁, A₁)=(1- α) Q_k(S_1,A₁)rα[R₂+ymax_aQ_k(S₂, A_z)]

(24) (23) constantly are recycled, until learning ability reaches requirement, i.e., arrival final state, circulation (22) are selected again Select initial capability state；

(25) optimal policy under each capability state is stored in the form of dictionary.

Further, specific step is as follows for ε-greedy strategy:

(1) ε ∈ [0,1] value, and the random number between a random 0-1 are specified；

(2) selectable resource under current ability state is randomly choosed if random number is less than ε learnt (each resource The probability selected isWherein | A1 | for selectable resource number under current state)；

(3) select that there is maximum rating-action value Q resource under current state if random number is more than or equal to ε It practises；

Steps are as follows for the path planning module:

(31) the current ability state s of target student is obtained；

(32) in the strategy of step (25) storage, the learning path l with s in the state of is found；

(33) path l is recommended into target student, and adaptive revised planning study in learning process continuous behind Path.

Further, adaptive revised planning path step is as follows:

(1) previous step (31,32) can plan learning path l, next according to the current ability s of target student for it After the habit stage, the capability state of target student is changed to s '；

(2) step (32) are repeated, according to the updated capability state s ' of target student, new recommendation paths l ' is planned for it Compare the subsequent path and l ' of l, if it is different, l ' replacement l is then used, it is then constant if they are the same.

The advantages of the present invention over the prior art are that: existing education resource recommended technology is broadly divided into rule-based Recommendation and data-driven education resource recommended technology, rule-based method carries out the recommendation of education resource, needs to lead Domain expert assesses the study scene of different type student, and defines corresponding extensive recommendation rules.It is a kind of labor-intensive Method, can be only applied to specific learning areas, scalability is not strong, and the present invention is based on intensified learning technology, using from Cost of labor is greatly saved compared with rule-based recommended method in dynamicization planning learning path；For modern extensive On-line education system, designer generally use the recommended method of data-driven, and the proposed algorithm of these data-drivens passes through mostly Compare the similitude between student and learning object to recommend suitable education resource for student, causes to exist in learning path big The education resource of similarity redundancy is measured, not in view of the efficiency that student ability is promoted, the present invention is with the history learning rail of a large amount of students Mark is sample, extracts the capability state of the student of student, using end-state as target training Generalization bounds, is realized most fast maximum The ability of ground promotion student；The present invention in such a way that online recommendation paths combine, solves recommendation using off-line training strategy Response speed problem, to realize Adaptive Planning learning path.

Detailed description of the invention

Fig. 1 is the system construction drawing of learning path planing method；

Fig. 2 is the flow diagram of environmental simulation；

Fig. 3 is the flow diagram of Strategies Training；

Fig. 4 is the schematic diagram of learning path reasonable evaluation；

Fig. 5 is the recommendation paths and non-recommended path average length comparison diagram of this technology and the prior art；

Fig. 6 is the schematic diagram of learning path efficiency assessment；

Fig. 7 is this technology route matching degree and capacity gain datagram.

Specific embodiment

The adaptive learning paths planning method proposed by the present invention based on intensified learning is explained in detail with reference to the accompanying drawing.

Adaptive learning paths planning method proposed by the present invention based on intensified learning, overall system architecture such as Fig. 1, base In the historical data of student and education resource, the user basic information of teacher and student, the content-data (class of different education resources Journey video, after class system, zone of discussion etc.) and student and education resource interbehavior data, initial data is stored regular It is transmitted in HDFS and saves for a long time, since learning path planning system can also generate student and education resource in the process of running Interbehavior data are equally also required to regularly update this batch data.Based on the partial data, environment mould is successively carried out It is quasi-, Strategies Training and path planning step, based on the study scene of markov decision process frame simulation student, extract and from Dispersion student each study stage Efficiency analysis as state, statistic behavior transfer is general from the learning behavior data of history Rate, and associative learning recruitment evaluation module trains the build-in attribute of obtained education resource, training generates during intensified learning Complicated on-line study scene form, is thus turned to the Ma Er of mathematics level by the instant award of intelligent body and environmental interaction feedback Section husband decision process frame, using nitrification enhancement, the optimal learning strategy of trial and error training, above section are counted due to it repeatedly The considerations of evaluation time cost is periodically offline update, is finally based on trained learning strategy, the current energy according to target student Power state plans optimal learning path for it, and to be enable to respond quickly recommender system, which is student Quickly and consistently carry out the recommendation of education resource and the planning of learning path, then by target student it is newly generated with study provide The interaction data in source is stored in database.

The present invention is based on intensified learning, it is pair that markov decision process, which is that complete observable environment is described, The abstract and idealization for a kind of mathematics level that intensified learning problem carries out, it enables complex environment to transform into machine The language and text that device understands, in order to be able to the challenge under actual environment be solved using the algorithm of intensified learning It answers.It thus needs to carry out each key element in markov decision process formal definitions mathematically, according to student Learning behavior data, simulation steps flow diagram is carried out to environment of the student in learning process as shown in Fig. 2, study is imitated The ability value at student's each moment that the training of fruit assessment models obtains is made as input according to normal distribution discretization ability value For state S；Based on the state divided, and a large amount of learning behavior data, statistics calculate state transition probability T；According to meter Formula is calculated, award R immediately can be calculated；Based on instant award, strategy is obtained using nitrification enhancement training, i.e., each state The optimal movement that can be taken down can be used for doing for target student and recommend, and the current capability state of input target student is planned for it Optimal learning path.Based on above-mentioned process, academic environment form complicated in online education can be turned to Markov and determined Plan process, is represented by a five-tuple<S, A, T, and R, γ>.

Strategies Training step involved in the present invention, process signal are as shown in Figure 3, the specific steps are as follows:

(1) five-tuple<S of the obtained markov decision process in storage environment simulation steps, A, T, R, γ>；

(2) an initial capability state S is randomly choosed from competence set S₁；

(3) based on ε-greedy strategy in S₁Resource A has been selected under capability state₁Learnt, has learnt A₁Later, root According to the next capability state S of environment Observable₂, while having obtained instant award R₂(consummatory behavior strategy), selects current energy at this time Maximum Q value is to update Q function (completing target strategy) under power state:

Q_k+1(S_1,A₁)=(1- α) Q_k(S_I,A₁)ra[R₂+γmax_aQ_k(S₂, A_z)]

(4) (23) constantly are recycled, until learning ability reaches requirement, i.e., arrival final state, circulation (22) reselect Initial capability state；

(5) optimal policy under each capability state is stored in the form of dictionary.

Adaptive learning paths planning method proposed by the present invention based on intensified learning, the ability current from target student State is set out, and optimal learning path is planned for it, so that student ability is obtained the promotion of highest effect, for the study of recommendation Path, the present invention compare the prior art, have carried out experimental evaluation for the learning path of recommendation, experimental section is classified into two sides Face, the validity experiment of recommendation paths and the reasonability experiment of recommendation paths.

1. reasonability is tested

The reasonability experiment of recommendation paths is mainly used for verifying, and the education resource in recommendation paths is for target student Whether rationally, consider from the length in path, if the most fast promotion for obtaining capacitation of student, i.e., more identical threshold energy can be made The path of power and identical final ability, whether path length is shorter compared with Actual path for recommendation paths.As shown in figure 4, this Invention is that the student of each capability state recommends a paths, for each paths, from the original interaction data of a large amount of student In, the non-recommended path of initial ability identical as recommendation paths and identical final ability is picked out, the difference of path length is compared Different, for the difference of the length of student's recommendation paths and non-recommended path of more different ability levels, the present invention rises according to student The Efficiency analysis of beginning is clustered, and student is fallen into 5 types, and from I to V, integration capability from low to high, counts under each classification and owns The non-recommended learning path length of start-stop ability identical as recommendation paths, and more accordingly pushed away in following different proposed algorithm It recommends and the mean value size of non-recommended path length, wherein UCF and ICF is Collaborative Filtering Recommendation Algorithm, PI, VI, Sarsa and Q_ Learning is the learning path planning algorithm based on intensified learning.For experimental index, the present invention is intuitively using recommendation road The length L of diameter_recAnd the average length L in non-recommended path_{no_rec}。

L_rec=l_rec

1) UCF: the collaborative filtering based on user calculates the similitude of student ability, recommends and target student ability The learning path of similar student.

2) ICF: the collaborative filtering based on article calculates the similitude of study Resource Properties, search and target student The similar education resource of history learning resource, will there is the student of interbehavior with this education resource, and other education resources are recommended Give target student.

3) PI: the path planning algorithm based on Policy iteration, the nitrification enhancement based on Dynamic Programming.

4) VI: the path planning algorithm based on value iteration, the nitrification enhancement based on Dynamic Programming.

5) Sarsa: the path planning algorithm based on Sarsa, Timing Difference synchronization policy nitrification enhancement.

6) Q_learning: the path planning algorithm based on Q_learning, the asynchronous tactful intensified learning of Timing Difference are calculated Method, the Strategies Training method used for the present invention.

The result of reasonability experiment as shown in figure 5, compare under different initiation capacity states, proposed algorithm initiation capacity compared with Performance when low is preferable, and initiation capacity, in a higher state, effect and the non-recommended effect of recommendation are not much different, Show that higher student of ability value itself has had stronger learning ability, and selectable resource space is smaller.

Based on the proposed algorithm of intensified learning under identical initial ability level, the path length of recommendation is integrally shorter than The recommendation paths of UCF and ICF algorithm, reason are that the path planning algorithm based on collaborative filtering only accounts for student or study money The similitude in source, for target student recommend similar student path or similar education resource, learning in view of student The demand of capability improving in the process.Wherein ICF is more that student recommends similar education resource, consolidates knowledge repeatedly though having Effect, reduce the forgetting of knowledge point, can also bring the promotion of ability value, but the similar education resource of repetition learning causes to learn The redundancy in path is practised, so that learning efficiency reduces.In contrast, UCF brings relatively more reasonable recommendation in path length Performance, but since it searches for already present learning path in existing student, other learning paths are not explored, and it is similar Student might not have optimal learning path, target student can not be made to reach most so as to cause the learning path of recommendation The promotion of big ability, if recommendation paths length of the UCF in class ii is 12, but be only capable of reaching can for its final integration capability The 72% of power.

Comparing four kinds of learning path planning algorithms based on intensified learning can reach under identical initial ability Highest capability state.Wherein the algorithm PI based on Policy iteration and the algorithm VI recommendation effect based on value iteration are almost the same, It is to find optimum state value function, difference is based on state in Policy iteration since essence is consistent in an iterative process for it Value is continuously evaluated stragetic innovation strategy, and being worth iteration then is direct searching optimum state value function, calculates plan further according to state value Slightly, but since Policy iteration has carried out double-layer lap generation, iteration efficiency is far below value iteration.

Sarsa and Q_learning algorithm is compared with the nitrification enhancement based on Dynamic Programming, same original state energy Under power, the learning path length of recommendation is relatively much shorter, recommends performance more excellent especially in I class and class ii, reason is base In the learning algorithm that the nitrification enhancement of Timing Difference is model-free, the ambient condition without relying on sample data shifts general Rate, and the diversity of data is also enriched while study come self-learning environment by way of continuous trial and error.

It is equally Temporal-difference, Q_learning algorithm is compared with Sarsa algorithm, in lower initial ability state Under, the recommendation learning path of Q_learning is shorter, then shows under the initial ability state of ability similar, and the main distinction exists Synchronization policy is used when Sarsa is in more new environment and value function, using same policy update state and movement, with selection Movement updates value function, and Q_learning uses asynchronous strategy, independently selects current value function maximum when updating value function When action value, exploring and use aspects have obtained better balance, thus be easier to obtain global optimum path, and Sarsa Update mode then tend to safer local optimum path.

And the problem of thus bringing be Q_leanring convergence rate it is slower compared with Sarsa, but grind in view of of the invention Study carefully content, trained strategy can be used for the online real-time recommendation learning path of student with off-line training strategy, thus Q_ Learning is a better choice of the invention.

2. validity is tested

Recommend validity experiment, as shown in fig. 6, analyzing true learning path using the existing historical interaction data of student With the distribution of matching degree and the student capability improving under really study scene of recommendation paths, i.e. identical of analysis foundation It is raw, it after the education resource for completing identical quantity, is matched with recommendation paths more, if ability value improves more.

The present invention is that the student of each capability state recommends an optimal path, for each paths, is learned from a large amount of In raw original interaction data, the true learning path of initial ability identical as recommendation paths is picked out, and with the length of recommendation paths Degree truncation, the matching degree of comparative analysis Actual path and recommendation paths, and final ability value mentioning compared to threshold energy force value It rises, i.e., under more identical initiation capacity state and same path length, the matching degree and ability for analyzing it with recommendation paths are mentioned The distribution situation risen.

Matching degree Match indicates under identical initiation capacity state, the matching journey in non-recommended path after recommendation paths and truncation Degree:

Wherein, | | Path_rec∩Path_{no_rec}| | indicate the length of recommendation paths and the continuous public substring of non-recommended path longest Degree, | | Path_rec| | indicate the length of recommendation paths.

Fig. 7 is the path planning algorithm experimental data based on Q_learning, and row indicates under identical match degree, different first The corresponding capacity gain of beginning ability；Under the identical initial ability of column expression, the corresponding capacity gain of Different matching degree.Wherein '-' It indicates not finding the Actual path with recommendation paths exact matching in the history interbehavior data of student.It can be seen by data Out, under identical match degree, when initial ability is lower, capability improving is bigger, such as schemes.When matching degree is 40% or more When, under identical initiation capacity state, capacity gain increases with matching degree and is improved, as shown in fig. 7, Actual path and recommending Route matching degree is higher, is more conducive to the promotion of student ability, and the path for sufficiently demonstrating recommendation promotes student ability Validity.

And for I, under II class initial ability state, in practical interbehavior data, can not find and recommendation paths The true path of exact matching indicates the proposed algorithm based on Q_learning based on the new overall situation of existing Data Mining most Shortest path.

Recited above is only the adaptive learning paths planning method embodiment embodied the present invention is based on intensified learning.This Invention is not limited to above-described embodiment.Specification of the invention is not limit the scope of the claims for being illustrated.For Those skilled in the art, it is clear that can have many replacements, improvements and changes.It is all to use equivalent substitution or equivalent transformation shape At technical solution, be all fallen within the protection domain of application claims.

Claims

1. a kind of adaptive learning path planning system based on intensified learning characterized by comprising environmental simulation module, Strategies Training module and path planning module；

Environmental simulation module, realize complicated online learning environment is converted to machine it will be appreciated that language and text；It is based on Student is reflected in the essential information of history learning record and education resource on on-line study platform according to improved project Principle, formalization obtain the five-tuple of markov decision process；

Strategies Training module realizes function of the off-line training based on the Path Planning under each capability state；According to environment mould The five-tuple for the markov decision process that quasi- module obtains, it is offline to instruct using the Q_learning algorithm based on intensified learning It gets based on the Path Planning under each capability state；

Path planning module is embodied as the function of target student's real-time perfoming path planning；It is obtained according to Strategies Training module Strategy is obtained the Optimal Learning path planned in real time for target student, is finally reached based on the current capability state of target student To the target for improving learning effect and efficiency.

2. the adaptive learning path planning system according to claim 1 based on intensified learning, it is characterised in that: described Environmental simulation module step is accomplished by

(21) S indicates capability state set, obtains the ability value at student's each moment according to the improved project principle of reflection, will The ability value of student is defined as state, for the discrete type for guaranteeing state, needs to carry out ability division, by every one-dimensional student ability Value divides ability value section in student's quantity normal distribution ratio, carrys out demarcation interval according to student's quantity Gaussian Profile ratio, Each section takes ability value of the mean value as the section；

(22) A indicates set of actions, refers to the behavior set that intelligent body can be taken, and in the environment of online education, as student is learned The resource collection of habit；

(23) T indicates state transition probability, after state and ability division after being divided based on the ability in step (11) Raw learning behavior path data, statistics calculate state transition probability T；

T (s, a, s ')=P (s_t+1=s ' | s_t=s, a_t=a)

WhereinIndicate stateful example,Expression acts example, and t indicates moment, s_tIndicate the shape under t moment State, a_tIndicate the movement selected under t moment；

(24) R indicates that award, award are divided into award immediately and accumulation award

Immediately award is applied to the learning process of student, and being interpreted as student in sometime state is that s ∈ S has learnt resource a ∈ A After be transferred to state s ' ∈ S, can obtain the instant reward value r (s, a, s ') at the moment, indicate the award that R is obtained at the moment Example, the reward value and is correctly completed probability, correct to shift the frequency and three factors of ability increment are related；

Accumulation award (Return, G), is also referred to as returned, and is defined as awarding a certain specific function of sequence, it is assumed that when current Carve be t, then after t moment after award sequence be R_t+1, R_t+2, R_t+3... R_M, M is that total duration then returns G and is expressed as each moment Immediately then the sum awarded adds discount factor and obtains:

(25) γ indicates discount factor, and in the expression of above-mentioned calculating accumulation award, γ ∈ [0,1] is equivalent to and returns following Report is made a discount, if γ levels off to 0, is only concerned about current instant award, often execute make currently to award immediately it is maximum Behavior, essence are a kind of greedy behaviors；It, can more consideration is given to future returns if γ levels off to 1.

3. the adaptive learning path planning system according to claim 1 based on intensified learning, it is characterised in that: described Steps are as follows for Strategies Training:

(31) five-tuple < S, A, T, R, the γ > of the obtained markov decision process in storage environment simulation steps；

(32) an initial capability state S is randomly choosed from capability state set S₁；

(33) based on ε-greedy strategy in capability state S₁Lower selection resource A₁Learnt, further according under environment Observable one A capability state S₂, while obtaining awarding R immediately₂, at this time select current ability state under maximum Q value to update Q value table:

Q_k+1(S₁, A₁)=(1- α) Q_k(S₁, A₁)+α[R₂+γmax_aQ_k(S₂, A₂)],

Wherein Q_kIndicate current Q value table, Q_k+1Indicate that updated Q value table, α indicate update ratio, every time more by new value part New old value；

(34) continuous circulation step (33), until learning ability reaches requirement, i.e., arrival final state, circulation step (22) are heavy Newly select initial capability state；

(35) optimal path under each capability state is stored in the form of dictionary, so far Strategies Training is completed.

4. the adaptive learning path planning system according to claim 1 based on intensified learning, it is characterised in that: described Path planning module realizes that steps are as follows:

(41) the current ability state s ∈ S of target student is obtained；

(42) in strategy, a learning path 1 with ability s in the state of is found；

(43) learning path is recommended into target student, and the adaptive revised planning study in subsequent learning process Path.

5. the adaptive learning path planning system according to claim 4 based on intensified learning, it is characterised in that: described In step (43), adaptive revised planning path step is as follows:

It (51) is that the student plans learning path, after next study stage, target student according to the current ability s of target student Capability state be changed to s '；

(52) step (42) are repeated, according to the updated capability state s ' of target student, new recommendation paths is planned for the student l′

(53) subsequent path of a learning path l in comparison step (42) and new recommendation paths l ', if it is different, then with new The recommendation paths l ' replacement step (42) in learning path l, it is then constant if they are the same.

6. the adaptive learning path planning system according to claim 1 based on intensified learning, it is characterised in that: described In step (21), the discretization method of student ability state interval is distributed ratio according to the Gaussian Profile of student's quantity 1: 2: 5: 2: 1 Example divides five sections.

7. the adaptive learning path planning system according to claim 1 based on intensified learning, it is characterised in that: step (24) in, instant reward value is related with following three factor:

P (T): being correctly completed probability, and student can be correctly completed the probability of education resource a under the moment ability value, based on Practise recruitment evaluation model prediction；

F (T): correctly shifting the frequency, all in student path that the sample that a is transferred to state s ' is displaced through from state s, wherein leading to The probability for being correctly completed education resource and completing transfer is crossed, is indicated are as follows:

C indicates sample number

Diff(s₁, s₂)=(s '-s) difficulty_a, the maximal increment of ability is expressed as the difference before and after ability before and after conversion It is worth the dot product of vector and education resource difficulty, to match the ability value of student and the difficulty of education resource, and by vector scalarization, Convenient for award calculating compared with；

Immediately award r is indicated are as follows:

R (s, a, s ')=ω × Diff (s, s ')

ω=P (T) × F (T)+(1-P (T)) × (1-F (T))

Wherein, coefficient of the ω as maximum capacity increment.

8. the adaptive learning path planning system according to claim 1 based on intensified learning, it is characterised in that: described In step (33), specific step is as follows for ε-greedy strategy:

(71) ε ∈ [0,1] value, and the random number between a random 0-1 are specified；

(72) it randomly chooses selectable resource under current ability state if random number is less than ε to be learnt, each resource quilt The probability of selection isWherein | A1 | for selectable resource number under current state；

(73) select under current state there is maximum rating-action value Q resource to be learnt if random number is more than or equal to ε.