CN114330716A - University student employment prediction method based on CART decision tree - Google Patents

University student employment prediction method based on CART decision tree Download PDF

Info

Publication number
CN114330716A
CN114330716A CN202111608264.1A CN202111608264A CN114330716A CN 114330716 A CN114330716 A CN 114330716A CN 202111608264 A CN202111608264 A CN 202111608264A CN 114330716 A CN114330716 A CN 114330716A
Authority
CN
China
Prior art keywords
employment
data
decision tree
student
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111608264.1A
Other languages
Chinese (zh)
Inventor
党向盈
鲍蓉
姜代红
徐玮玮
佟恒乐
王晓雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xuzhou University of Technology
Original Assignee
Xuzhou University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xuzhou University of Technology filed Critical Xuzhou University of Technology
Priority to CN202111608264.1A priority Critical patent/CN114330716A/en
Publication of CN114330716A publication Critical patent/CN114330716A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a CART decision tree-based college student employment prediction method, and aims to provide a method for predicting college student employment conditions. Firstly, preprocessing data information of college students to form a standard basic attribute data set for data mining; then, determining the correlation between the basic attributes of the undergraduates in the data set and the employment prediction target attributes by utilizing a Pearson correlation analysis method, and determining the basic attributes of the undergraduates related to the employment prediction target attributes as feature vectors for constructing an undergraduate employment prediction model; finally, based on the training set, calculating a kini coefficient by the characteristic vector; and constructing a university student employment prediction model by adopting a CART-based decision tree algorithm. The method can predict the employment situation of the university students according to the information data set of the university students, provide intelligent service for the employment management departments of colleges and universities, guide the students to reasonably take employment, and contribute to improving the employment rate of the university students.

Description

University student employment prediction method based on CART decision tree
Technical Field
The invention relates to the technical field of artificial intelligence informatization and big data analysis, in particular to a CART decision tree-based college student employment prediction model for predicting the employment situation of college students according to the past college student employment big data information.
Background
The college student number in 2019 is up to 830 thousands of people, and the college student number in 2020 breaks through 840 thousands of people. With the addition of nearly 30 million students returning to the country and college students who have not found work before, and the number of social re-employment people, nearly ten million people will be engaged in competition for employment opportunities in 2020. The domestic employment situation pressure is huge, the newly-increased labor force far exceeds the newly-increased employment opportunity, and due to the continuous development of the national advanced education process, the employment competition severity of the contemporary university students is seriously aggravated by high-quality talent ratios. Secondly, the imbalance problem of employment structure is also very serious. From the employment areas, more college students are willing to develop employment in the first-line and second-line areas, but are not willing to develop in the three-four-line cities. However, the performance of college students during school, such as scores, whether the students are cadres or not, and other factors can influence the employment situation of the college students, and meanwhile, from the perspective of subject specialty, the employment situation of the college students in study departments is slightly better, and the employment situation of the subject specialty is not optimistic.
Decision trees belong to a supervised learning model in the field of artificial intelligence, And three types of the decision trees include ID3, C4,5 And CART (classification And Regression Tree), which are typical classification prediction algorithms. The node in the decision tree represents a certain attribute value, the bifurcation represents all possible values of the attribute represented by the node, and the leaf node represents the prediction result of the association rule from the root node to the current leaf node. The decision tree may have a single output or a plurality of outputs. Decision tree models are often employed in data mining to accomplish predictive data mining tasks. The generation process of the decision tree is very complex, firstly, a proper decision tree algorithm needs to be selected according to a prepared data set and a mining target, and different decision tree algorithms are suitable for different mining tasks; for the constructed decision tree model, a test set with a certain data quantity needs to be divided from the original data set, and the test set is used for evaluating the accuracy of the constructed decision tree model and analyzing whether the decision tree model meets the mining purpose.
In the existing university student employment prediction method, the influence of relevant attributes in the employment information of the university students is considered to be less. The method uses a more classical CART algorithm in a decision tree algorithm to construct a employment prediction model. The prediction function is realized by firstly carrying out preprocessing such as cleaning and correlation analysis on collected data, then constructing a model based on a CART decision tree and finishing model training, wherein the constructed model can be used for predicting employment areas, positions, salaries and the like of college students, so that the recommendation of the related positions of the college students is further realized.
Disclosure of Invention
In order to solve the defects in the university student employment situation prediction technology, the invention provides a CART decision tree-based university student employment prediction method, attributes related to employment situations are determined according to basic attributes in university student information data, and a CART decision tree prediction model is constructed, wherein the CART decision tree prediction model can predict the university student employment situations.
The technical scheme adopted by the invention is as follows: an university student employment prediction method based on a CART decision tree comprises the following steps:
s1: preprocessing information data of college students;
collecting primary data of college students, constructing a student data basic attribute set, and carrying out standardization processing on each data to form a standardized data set, wherein the college student data basic attribute set is recorded as N ═ { N ═ N1,n2,…,ncIn which n isiIs the ith basic attribute, and c is the number of the basic attributes;
s2: determining relevant attributes influencing the employment prediction targets of the college students;
setting the undergraduate employment prediction target attribute set as Y ═ Y1,y2,…y|Y|Y, where Y is the number of values of the predicted target attributeuIs a predicted target attribute value;
calculating the element N in NiAnd the element Y in YuHas a Pearson correlation coefficient of lambdai,uComprises the following steps:
Figure BDA0003431457900000021
cov (n) among themi,yu) Is niAnd yuThe variance of the covariance,
Figure BDA0003431457900000022
and
Figure BDA0003431457900000023
are each niAnd yuStandard deviation of (2).
Setting the Pearson correlation coefficient threshold value as h when lambdai,uIs not less than h, defined asiIs related to Y; otherwise, define niIs not related to Y; based on the method, the basic attributes of the undergraduates related to Y are counted; recording the related attributes influencing the employment objective Y of the college students as a characteristic vector X ═ X1,x2,…,xmThe symbol is, wherein m is the number of characteristic variables, and m is less than or equal to c; wherein for xiHas a value of KiClass, is marked as
Figure BDA0003431457900000024
S3: constructing a university student employment prediction model based on the CART decision tree;
setting basic attribute data information of the university students as alpha groups, setting r group data in the alpha groups as a training set S, and setting the rest alpha-r group data as a test set; the training set S is used for constructing a employment prediction model, and the testing set is used for verifying the accuracy of the employment testing model;
calculated in the training set S
Figure BDA0003431457900000031
Coefficient of kini of
Figure BDA0003431457900000032
Carrying out the fundamental coefficient solution on the basic attributes of the college students in the training set S, setting the threshold value of the fundamental coefficient as l, and then based on
Figure BDA0003431457900000033
And (3) constructing a career decision tree of college students, namely a career prediction model.
Preferably, in step S3, 70% of the data is set as the training set, and 30% of the data is set as the test set.
Preferably, in step S3, the method for solving the kini coefficient of the basic attributes of the college students in the training set S includes:
when x isiTake a value of
Figure BDA0003431457900000034
When is recorded as
Figure BDA0003431457900000035
When x isiValue is not
Figure BDA0003431457900000036
When is recorded as
Figure BDA0003431457900000037
Whereby S can be divided into
Figure BDA0003431457900000038
Figure BDA0003431457900000039
Figure BDA00034314579000000310
Two parts, the number of corresponding training sets is respectively
Figure BDA00034314579000000311
And
Figure BDA00034314579000000312
at S, when
Figure BDA00034314579000000313
When Y takes on value YuHas a probability of
Figure BDA00034314579000000314
When in use
Figure BDA00034314579000000315
When Y takes on value YuIs given by
Figure BDA00034314579000000316
Then it is determined that,
Figure BDA00034314579000000317
the kini coefficient of (a) can be expressed as:
Figure BDA00034314579000000318
in the same way
Figure BDA00034314579000000319
The kini coefficient of (a) can be expressed as:
Figure BDA00034314579000000320
by
Figure BDA00034314579000000321
And
Figure BDA00034314579000000322
as can be seen, for S, V (x)i) Get
Figure BDA00034314579000000323
Coefficient of kini of
Figure BDA00034314579000000324
Can be expressed as:
Figure BDA00034314579000000325
preferably, the step S3 is based on
Figure BDA00034314579000000326
The method for constructing the university student employment CART decision tree comprises the following steps:
inputting: s, X ═ X1,x2,…,xm},l,m;
And (3) outputting: a decision tree T;
step 1: computing
Figure BDA00034314579000000327
If it is not
Figure BDA00034314579000000328
T is a single node tree; otherwise, turning to Step 2;
step2 for
Figure BDA0003431457900000041
Solving for their minimum value, noting that the minimum value is
Figure BDA0003431457900000042
Get
Figure BDA0003431457900000043
Is a cutting point of a binary tree;
step3 according to x in SiWhether the value is equal to
Figure BDA0003431457900000044
Divide S into two subsets
Figure BDA0003431457900000045
And
Figure BDA0003431457900000046
and will be
Figure BDA0003431457900000047
And
Figure BDA0003431457900000048
distributing the child nodes into two child nodes, wherein if the child node has a damping coefficient less than l, the child node is a leaf node, if the two child nodes are both leaf nodes, returning to the decision tree T, otherwise, performing Step 4;
step4 for non-leaf nodes, respectively in order
Figure BDA0003431457900000049
And
Figure BDA00034314579000000410
order to
Figure BDA00034314579000000411
And recursively calling Step1 to Step4 to generate a binary decision tree T.
After construction, the effectiveness of the built prediction model needs to be evaluated. The test set samples have alpha-r in total, the basic attributes of students are used as the input of a CART decision tree prediction model, the output result of the target attribute is counted and then is compared with the real target attribute in the test set in a consistency mode, b data are consistent, and then the accuracy of the prediction model can be expressed as follows:
Figure BDA00034314579000000412
if the accuracy exceeds a certain threshold, the constructed prediction model is valid.
The invention has the beneficial effects that: different from the prior art, the invention fully considers the basic attributes in the university student employment information related to the employment prediction target and adopts the Pearson correlation coefficient method to determine the related attributes influencing the employment of the university students. According to the invention, a proper calculation method of the kini coefficient can be designed according to the characteristics of information data of college students; according to the characteristics of university student information data, an university student employment prediction model can be constructed based on the CART decision tree. The invention provides intelligent service for employment management departments of colleges and universities, guides students to reasonably adopt employment, is beneficial to improving the employment rate of college students and can also provide intelligent position recommendation for a recruitment network platform.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a general flowchart of a CART decision tree-based college student employment prediction method provided by the present invention;
FIG. 2 is a data fragment diagram of a student data information summary table;
FIG. 3 is a data fragment diagram after student data information summary table normalization processing;
FIG. 4 is a thermodynamic diagram of the correlation between student attributes of a data set;
FIG. 5 is a diagram of a decision tree model with employment area attributes as prediction targets;
FIG. 6 is a prediction tools page;
fig. 7 is a prediction result display page.
Detailed Description
For further explanation of the details of the technical solutions of the present invention and their advantages, reference is now made to the detailed description of the embodiments taken in conjunction with the accompanying drawings.
Step S1: preprocessing of university student information data
1.1 college student data acquisition and integration
The raw data collection is from different management departments of the school, including a student basic information data table, a student achievement data table of each period, a student employment situation data table and the like, the data in each table contains a plurality of student attributes, such as school number, name, specialty, class, unit name and unit telephone, and the tables contain some repeated attributes and need to process the raw data.
Different data attributes are first integrated into a data table to form the final canonical data set. For example, the student basic information data sheet, the student achievement data sheet and the student employment situation data sheet contain many same attributes, information in different data sheets is integrated by using the attribute, namely the student number, for uniquely identifying the student, and the same attribute is deleted to form an undergraduate data information summary sheet; the method is characterized in that possible attributes of employment trend prediction are reserved, data such as a study number, a mobile phone number and the like are redundant data, and the attributes are deleted during data preprocessing.
1.2 data normalization processing
In the data mining process, a certain attribute has a large number of different values, and the attributes can be subjected to normalized processing, so that the values of the attributes fall into a limited and smaller value domain. Such as a gender attribute, with boys labeled 1 and girls labeled 0. The property of the origin place is that according to the latest Chinese city grade division table in 2020, the information of the origin place is divided into five types according to the grade of the city where the household registration is located, and the cities of one, two, three, four and five lines are respectively represented by the numbers 1,2,3,4 and 5. Other student attribute values are normalized in a similar manner.
The normalized "college student data information summary table" has c items of attributes such as profession, place of birth, unit name, cadre, score, etc. These college student basic attribute sets are denoted as N ═ N1,n2,…,ncIn which n isiIs the ith basic attribute, and c is the number of basic attributes.
Step S2: determining relevant attributes affecting undergraduate employment prediction goals
The employment prediction target attribute of the college students can be employment areas, position salaries and the like, and the attributes of the students corresponding to different prediction targets are different. The university student employment prediction target attribute is set as Y ═ Y1,y2,...y|Y|Y, where Y is the number of values of the predicted target attributeiIs a predicted target attribute value. For example, the predicted target attribute Y is a employment area, and the employment area has an attribute value Y1,y2,y3,y4,y5Corresponding to cities of one, two, three, four and five lines.
In order to find out the attributes of students that affect Y, the present invention employs the pearson correlation coefficient method. Pearson's correlation coefficient of Y and N, cov (N)i,yu) Is niAnd yjThe variance of the covariance,
Figure BDA0003431457900000061
and
Figure BDA0003431457900000062
are each niAnd yuStandard deviation of (2). Then, niAnd yuCoefficient of correlation λi,uCan be expressed as:
Figure BDA0003431457900000063
from the above formula, λ is knowni,uValues of (a) are always between-1.0 and 1.0, variables close to 0 are said to be uncorrelated, and close to 1 or-1 is said to have strong correlation.
The threshold is set to h. When lambda isi,uIs not less than h, defined asiIs related to Y. n isiSelected as a characteristic variable for constructing a employment prediction model; otherwise, niAre not selected as feature variables for building employment prediction models. The attribute of the student associated with Y is denoted as X ═ X1,x2,…,xmM is the number of characteristic variables, m is less than or equal to c, for xiHas a value of KiClass, is marked as
Figure BDA0003431457900000064
For example, the property of the origin is related to the property of the employment area of the university students, and the value of the property of the origin is a city of one line, two lines, three lines, four lines and five lines, which is marked as {1,2,3,4,5 }.
Step S3 construction of university student employment prediction model based on CART decision tree
Setting the number of records in the college student data information summary table as alpha, setting S as a training set, and setting the number of the training data sets as r as alpha multiplied by u; the number of test sets is alpha-r. The training set is used for constructing the employment prediction model, the testing set is used for verifying the accuracy of the employment prediction model, u is determined according to the actual condition, and u is generally required to be larger than 70.
The CART decision tree algorithm uses the minimization criterion of the kini coefficient (Gini) to select features, generating a binary tree. Based on xiCan divide the training set S into two parts, when xiTake a value of
Figure BDA0003431457900000065
When is shown as
Figure BDA0003431457900000071
When x isiValue is not
Figure BDA0003431457900000072
When is shown as
Figure BDA0003431457900000073
Thus, S can be divided into two parts, each of which is denoted as
Figure BDA0003431457900000074
And
Figure BDA0003431457900000075
the number of corresponding training sets is respectively
Figure BDA0003431457900000076
And
Figure BDA0003431457900000077
3.1 determining the Keyny coefficients of a student Attribute training set
In the training set S, when
Figure BDA0003431457900000078
When Y takes on value YuIs expressed as
Figure BDA0003431457900000079
When in use
Figure BDA00034314579000000710
When Y takes on value YuIs expressed as
Figure BDA00034314579000000711
Then it is determined that,
Figure BDA00034314579000000712
the kini coefficient of (a) can be expressed as:
Figure BDA00034314579000000713
in the same way
Figure BDA00034314579000000714
The kini coefficient of (a) can be expressed as:
Figure BDA00034314579000000715
by
Figure BDA00034314579000000716
And
Figure BDA00034314579000000717
it can be seen that, for S,
Figure BDA00034314579000000718
the coefficient of kini is recorded as
Figure BDA00034314579000000719
Figure BDA00034314579000000720
The kini coefficient represents the attribute xiThe smaller the Giny coefficient is, the higher the purity of the attribute is, and the feature degree isThe higher. Let the threshold value of the kini coefficient be l.
3.2 college student employment decision tree construction algorithm
Inputting: s, X ═ X1,x2,…,xm},l,m
And (3) outputting: decision tree T
Step 1: computing
Figure BDA00034314579000000721
If it is not
Figure BDA00034314579000000722
T is a single node tree; otherwise, turning to Step 2;
step2 for
Figure BDA00034314579000000723
Solving for their minimum value, noting that the minimum value is
Figure BDA00034314579000000724
Get
Figure BDA00034314579000000725
Is a cutting point of a binary tree;
step3 according to x in SiWhether the value is equal to
Figure BDA00034314579000000726
Divide S into two subsets
Figure BDA00034314579000000727
And
Figure BDA00034314579000000728
and will be
Figure BDA00034314579000000729
And
Figure BDA00034314579000000730
distributing the data into two child nodes, if the child node has a damping coefficient less than l, the child node is a leaf node, ifIf the two child nodes are leaf nodes, returning to the decision tree T, otherwise, performing Step 4;
step4 for non-leaf nodes, respectively in order
Figure BDA0003431457900000081
And
Figure BDA0003431457900000082
order to
Figure BDA0003431457900000083
And recursively calling Step1 to Step4 to generate a binary decision tree T.
After construction, the effectiveness of the built model needs to be evaluated. The test set samples have alpha-r in total, the basic attributes of each student are used as the input of a CART decision tree prediction model, the output result of the target attribute is counted and then is compared with the real target attribute in the test set in a consistency mode, b data are consistent, and then the accuracy SR of the prediction model can be expressed as:
Figure BDA0003431457900000084
if the accuracy exceeds a certain threshold, the constructed prediction model is valid.
Example analysis
1. Data pre-processing
The research object of the research is related data of employment of computer professional college students in 2018 of a certain school.
Table 1 is student basic data table
Attribute name Name of field Data type
Name (I) Name Char(20)
Number learning ID Char(20)
Sex Gender Char(20)
Political aspect Politics Char(20)
Source of life ground Origin Char(20)
Whether or not there is a dry part Leader Char(20)
TABLE 2 student achievement data sheet
Attribute name Name of field Data ofType (B)
Name (I) Name Char(20)
Number learning ID Char(20)
Professional Major Char(20)
Class of class Clbum Char(20)
Achievement Score Char(20)
Table 3 is a student employment data table
Attribute name Name of field Data type
Name (I) Name Char(20)
Number learning ID Char(20)
Sex Gender Char(20)
Employment situation Job Char(20)
Name of unit Firm Char(20)
Unit address Address Char(20)
Job category JobType Char(20)
Unit class FirmType Char(20)
Unit telephone Tel Char(20)
Different kinds of information of students come from different management departments of schools respectively and are relatively dispersed, so that the received original data need to be integrated. And integrating different data information together and putting the integrated data information into a unified data table so as to form a final specified data set. The three data tables contain many same attributes, and information in different data tables is integrated by using the unique identification student attribute 'school number', and the same attribute is deleted to form a 'student data information summary table', as shown in table 4.
Table 4 is student data information summary table
Figure BDA0003431457900000091
Figure BDA0003431457900000101
The student data information table after the summary contains more redundant attributes, so that redundant attributes of the students after the summary are deleted. The student data information table after the summary has 14 attributes, wherein the 6 attributes including the school number, the name, the specialty, the class, the unit name and the unit telephone number do not influence the employment tendency area of the university students, and the 6 attributes are deleted. In addition, in order to describe the employment situation of the students more intuitively and effectively, the attribute of 'position matching' is introduced to describe whether the professional learned by the students is matched with the position category of the employment realized by the students. Finally, a student data information summary table containing 9 attributes is formed, as shown in table 5.
Table 5 is student data information summary table
Attribute name Name of field Data type
Sex Gender Char(20)
Political aspect Politics Char(20)
Source of life ground Origin Char(20)
Whether or not there is a dry part Leader Char(20)
Achievement Score Char(20)
Employment situation Job Char(20)
Job category JobType Char(20)
Job matching Match Char(20)
Unit address Address Char(20)
According to the preprocessing of the university student data, a student data information summary table is obtained, and the data segments are shown in fig. 2.
In the data mining process, if a certain attribute has a large number of different values, obstruction can be generated in the data mining process, the values of the attributes can be normalized, so that the values of the attributes fall into a limited and smaller value domain, and the analysis mining of data and the generation of a decision tree can be facilitated. The 9 attributes involved are normalized as shown in table 6.
TABLE 6 conversion and comparison table for attribute values
Figure BDA0003431457900000111
And converting each attribute according to the attribute value conversion comparison table of the table 6 and the specification to form a standard data set, wherein the converted data segments are shown in fig. 3.
2. Correlation analysis of employment attributes affecting college students
In order to establish a decision tree prediction model of the employment tendency areas of the college students, data mining needs to be carried out on information of the college students, the correlation between the basic attributes of the students and the attributes of the employment areas in the table 6 is determined, and then the basic attributes of the students with high correlation are determined.
Based on equation (1), the correlations of the student basic attributes and the prediction target attributes are calculated, respectively. Fig. 4 is a thermodynamic diagram of the relationship between the attributes of the data set, in which the abscissa is from left to right and the ordinate is from top to bottom, "employment area", "biographical area", "whether cadre", "achievement", "job category", "job matching", "sex", "political face", "employment situation", respectively. In the thermodynamic diagram, the darker the color is to reflect the higher the correlation degree, the lighter the color is to reflect the lower the correlation degree, and the numerical value in the grid in the thermodynamic diagram represents the correlation degree between the two attributes of the horizontal and vertical coordinates.
FIG. 7 is a table of correlation partitions
Interval of taking correlation coefficient Means of
0.8-1.0 The correlation degree is very high
0.5-0.8 Moderate degree of correlation
0.2-0.5 The degree of correlation is low
0.0-0.2 Very low or no correlation
From the analysis of fig. 4 and table 7, it can be seen that: the relevance of the 'biogenic place', 'score', 'position category', 'gender', 'position matching' and 'whether cadres' to the 'employment area of college students' is strong, so the six attributes are selected as characteristic variables and used for constructing a prediction model of the employment area of the college students.
3. Construction of employment prediction model based on CART decision tree
The data set comprises 500 samples, 70% of the data set is a training set, and 350 samples, wherein each sample has 6 attributes, namely 'place of life', 'achievement', 'position category', 'gender', 'position matching' and 'whether cadre' respectively, and 1 predicted target attribute 'unit address (employment area)'.
The 'origin' value collection is {1,2,3,4,5}, the 'score' value collection is {0, 1}, the 'position category' value collection is {0, 1}, the 'gender' value collection is {0, 1}, the 'position matching' value collection is {0, 1}, the 'cadre' value collection is {0, 1}, the 'unit address' value collection is {1,2,3,4,5 }.
In the following, the property "provenance" is taken as an example to solve the kiney coefficients of all possible cut points.
The given threshold value of the kini coefficient is 0.255, and if the kini coefficient of a certain node is smaller than the given threshold value, the node is a leaf node. The number of samples of "1" is 8, the number of samples of "2" is 97, the number of samples of "3" is 185, the number of samples of "4" is 50, and the number of samples of "5" is 10 for the attribute "provenance". The number of samples of attribute "unit address" is 51, the number of samples of attribute "1" is 179, the number of samples of attribute "2" is 81, the number of samples of attribute "4" is 25, and the number of samples of attribute "5" is 15. When the attribute 'origin' is taken as '1', the number of samples of the corresponding prediction attribute 'unit address' taken as '1' is 5, the number of samples taken as '2' is 3, the number of samples taken as '3' is 0, the number of samples taken as '4' is 0, and the number of samples taken as '5' is 0.
Based on equations (1) - (3), the kini coefficient is calculated as:
Figure BDA0003431457900000121
Figure BDA0003431457900000122
Figure BDA0003431457900000123
repeating the steps to obtain:
gini (D, biogenic place 2) ═ 0.561
Gini (D, 3. from birth) 0.512
Gini (D, origin of origin ═ 4) ═ 0.533
Gini (D, biogenic ground-5) ═ 0.568
From the analysis of the calculation results, since Gini (D, 3 from the origin) is 0.512, the optimum cut point having the attribute "origin" of "3 from the origin" is selected.
The attributes of "score", "job category", "gender", "job matching" and "cadre" are all binary and do not need to be segmented. Their respective damping coefficients are:
gini (D, score 0) 0.587
Gini (D, job category 0) ═ 0.625
Gini (D, sex-0) ═ 0.598
Gini (D, job matching 0) 0.621
Gini (D, dry or dry part 0) 0.579
As can be seen from the analysis of the above calculation results, since Gini (D, where the origin is 3) ═ 0.512 is the minimum, "origin" is selected as the optimal feature, and "origin is 3" is selected as the optimal division point, that is, the attribute "origin" is selected as the root node, and "origin is 3" is selected as the optimal division point of the root node. The root node generates two child nodes, one is a leaf node, the other leaf node continuously selects the optimal feature and the optimal segmentation point thereof in the 'achievement', 'position category', 'gender', 'position matching' and 'whether cadre' by using the method, and finally the construction of the university student employment tendency area decision tree model is completed. The constructed decision tree prediction model of the university student employment area is shown in fig. 5.
The effectiveness of the created prediction model is evaluated below. In this example, the number of training sets is 350 and the number of test sets is 150. Based on the decision tree prediction model for the university student employment area established above, the preparation rate was calculated to be 67.67% from equation (5).
The method is applied to actual undergraduate employment area prediction, a prediction tool page is shown in figure 6, and a prediction result display page is shown in figure 7.

Claims (5)

1. An university student employment prediction method based on a CART decision tree is characterized by comprising the following steps:
s1: preprocessing information data of college students;
collecting primary data of college students, constructing a student data basic attribute set, and carrying out standardization processing on each data to form a standardized data set, wherein the college student data basic attribute set is recorded as N ═ { N ═ N1,n2,…,ncIn which n isiIs the ith basic attribute, and c is the number of the basic attributes;
s2: determining relevant attributes influencing the employment prediction targets of the college students;
setting the undergraduate employment prediction target attribute set as Y ═ Y1,y2,...y|Y|Y, where Y is the number of values of the predicted target attributeuIs a predicted target attribute value; let the element N in NiAnd the element Y in YuHas a Pearson correlation coefficient of lambdai,u
Setting the Pearson correlation coefficient threshold value as h when lambdai,uIs not less than h, defined asiIs related to Y; otherwise, define niIs not related to Y; based on the method, the basic attributes of the undergraduates related to Y are counted; recording the related attributes influencing the employment objective Y of the college students as a characteristic vector X ═ X1,x2,…,xmThe symbol is, wherein m is the number of characteristic variables, and m is less than or equal to c; wherein for xiHas a value of KiClass, is marked as
Figure FDA0003431457890000011
S3: constructing a university student employment prediction model based on the CART decision tree;
setting basic attribute data information of the university students as alpha groups, setting r group data in the alpha groups as a training set S, and setting the rest alpha-r group data as a test set; the training set S is used for constructing a employment prediction model, and the testing set is used for verifying the accuracy of the employment testing model;
calculated in the training set S
Figure FDA0003431457890000012
Coefficient of kini of
Figure FDA0003431457890000013
Carrying out the fundamental coefficient solution on the basic attributes of the college students in the training set S, setting the threshold value of the fundamental coefficient as l, and then based on
Figure FDA0003431457890000014
And (3) constructing a career decision tree of college students, namely a career prediction model.
2. The CART decision tree-based college student employment prediction method according to claim 1, wherein: in step S2, the pearson correlation coefficient λ of Y and N is calculatedi,uThe method comprises the following steps:
Figure FDA0003431457890000015
cov (n) among themi,yu) Is niAnd yuThe variance of the covariance,
Figure FDA0003431457890000021
and
Figure FDA0003431457890000022
are each niAnd yuStandard deviation of (2).
3. The CART decision tree-based college student employment prediction method according to claim 1, wherein: in step S3, 70% of the data is set as the training set, and 30% of the data is set as the test set.
4. The CART decision tree-based college student employment prediction method according to claim 1 or 3, wherein: in step S3, the method for solving the kini coefficient of the basic attributes of the college students in the training set S includes:
when x isiTake a value of
Figure FDA0003431457890000023
When is recorded as
Figure FDA0003431457890000024
When x isiValue is not
Figure FDA0003431457890000025
When is recorded as
Figure FDA0003431457890000026
Whereby S can be divided into
Figure FDA0003431457890000027
And
Figure FDA0003431457890000028
two parts, the number of corresponding training sets is respectively
Figure FDA0003431457890000029
And
Figure FDA00034314578900000210
at S, when
Figure FDA00034314578900000211
When Y takes on value YuHas a probability of
Figure FDA00034314578900000212
When in use
Figure FDA00034314578900000213
When Y takes on value YuIs given by
Figure FDA00034314578900000214
Then it is determined that,
Figure FDA00034314578900000215
of (2) aThe damping coefficient can be expressed as:
Figure FDA00034314578900000216
in the same way
Figure FDA00034314578900000217
The kini coefficient of (a) can be expressed as:
Figure FDA00034314578900000218
by
Figure FDA00034314578900000219
And
Figure FDA00034314578900000220
as can be seen, for S, V (x)i) Get
Figure FDA00034314578900000221
Coefficient of kini of
Figure FDA00034314578900000222
Can be expressed as:
Figure FDA00034314578900000223
5. the CART decision tree-based college student employment prediction method according to claim 1 or 3, wherein: in the step S3, based on
Figure FDA00034314578900000224
The method for constructing the university student employment CART decision tree comprises the following steps:
let the threshold value of the Keyney coefficient be l
Inputting: s, X ═ X1,x2,…,xm},l,m;
And (3) outputting: a decision tree T;
step 1: computing
Figure FDA0003431457890000031
If it is not
Figure FDA0003431457890000032
T is a single node tree; otherwise, turning to Step 2;
step2 for
Figure FDA0003431457890000033
Solving for their minimum value, noting that the minimum value is
Figure FDA0003431457890000034
Get
Figure FDA0003431457890000035
Is a cutting point of a binary tree;
step3 according to x in SiWhether the value is equal to
Figure FDA0003431457890000036
Divide S into two subsets
Figure FDA0003431457890000037
And
Figure FDA0003431457890000038
and will be
Figure FDA0003431457890000039
And
Figure FDA00034314578900000310
is distributed into two child nodes, and if the child node has a damping coefficient less than l, the child node is a leaf node, such asIf the two child nodes are leaf nodes, returning to the decision tree T, otherwise, performing Step 4;
step4 for non-leaf nodes, respectively in order
Figure FDA00034314578900000311
And
Figure FDA00034314578900000312
order to
Figure FDA00034314578900000313
And recursively calling Step1 to Step4 to generate a binary decision tree T.
CN202111608264.1A 2021-12-24 2021-12-24 University student employment prediction method based on CART decision tree Pending CN114330716A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111608264.1A CN114330716A (en) 2021-12-24 2021-12-24 University student employment prediction method based on CART decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111608264.1A CN114330716A (en) 2021-12-24 2021-12-24 University student employment prediction method based on CART decision tree

Publications (1)

Publication Number Publication Date
CN114330716A true CN114330716A (en) 2022-04-12

Family

ID=81014001

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111608264.1A Pending CN114330716A (en) 2021-12-24 2021-12-24 University student employment prediction method based on CART decision tree

Country Status (1)

Country Link
CN (1) CN114330716A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116029379A (en) * 2022-12-31 2023-04-28 中国电子科技集团公司信息科学研究院 Method for constructing air target intention recognition model
CN116563067A (en) * 2023-05-15 2023-08-08 北京融信数联科技有限公司 Big data-based graduate crowd employment analysis method, system and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116029379A (en) * 2022-12-31 2023-04-28 中国电子科技集团公司信息科学研究院 Method for constructing air target intention recognition model
CN116029379B (en) * 2022-12-31 2024-01-02 中国电子科技集团公司信息科学研究院 Method for constructing air target intention recognition model
CN116563067A (en) * 2023-05-15 2023-08-08 北京融信数联科技有限公司 Big data-based graduate crowd employment analysis method, system and medium
CN116563067B (en) * 2023-05-15 2024-03-15 北京融信数联科技有限公司 Big data-based graduate crowd employment analysis method, system and medium

Similar Documents

Publication Publication Date Title
Putpuek et al. Comparative study of prediction models for final GPA score: a case study of Rajabhat Rajanagarindra University
CN114330716A (en) University student employment prediction method based on CART decision tree
CN111914162B (en) Method for guiding personalized learning scheme based on knowledge graph
CN110599839A (en) Online examination method and system based on intelligent paper grouping and text analysis review
CN111831905A (en) Recommendation method and device based on team scientific research influence and sustainability modeling
Pumpuang et al. Comparisons of classifier algorithms: Bayesian network, C4. 5, decision forest and NBTree for Course Registration Planning model of undergraduate students
Lottering et al. A model for the identification of students at risk of dropout at a university of technology
Lundberg et al. Researcher reasoning meets computational capacity: Machine learning for social science
CN114912772A (en) Urban right transparency differential evaluation system matching method and system based on urban economic classification analysis
Yet et al. Estimating criteria weight distributions in multiple criteria decision making: a Bayesian approach
Dissanayake et al. Predictive modeling for student retention at St. Cloud state university
CN113344692A (en) Method for establishing network loan credit risk assessment model with multi-information-source fusion
Kirkos et al. Data mining in finance and accounting: a review of current research trends
Zaboev et al. Evaluation of current location and prospects of the European and Russian universities among the world's leading universities with the use of neural network methods clustering of data
Li et al. University Students' behavior characteristics analysis and prediction method based on combined data mining model
CN112766765A (en) Professional learning ability evaluation method and system based on interval middle intelligence theory
Abdelfattah Variables Selection Procedure for the DEA Overall Efficiency Assessment Based Plithogenic Sets and Mathematical Programming
CN114282875A (en) Flow approval certainty rule and semantic self-learning combined judgment method and device
Kamal et al. Modeling success factors for start-ups in western europe through a statistical learning approach
Zhang et al. Multiple classification models based student's phobia prediction study
Handayani et al. Choosing alternative managements of solid waste from tofu producing small and medium enterprises in East Aceh district by analytical hierarchy process (AHP)
CN116523225B (en) Data mining-based overturning classroom hybrid teaching method
Yang University Teaching Evaluation Based on APriori Algorithm and Data Mining
Prudêncio et al. Selecting and ranking time series models using the NOEMON approach
Jabir et al. Enhancing The Quality of College Decisions Through Decision Tree and Random Forest Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination