CN111415099A

CN111415099A - Poverty-poverty identification method based on multi-classification BP-Adaboost

Info

Publication number: CN111415099A
Application number: CN202010236492.XA
Authority: CN
Inventors: 杨建锋; 魏瀚哲; 王朝阳
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-14

Abstract

A poverty-suffering recognizing method based on multi-classification BP-Adaboost comprises the following steps: (1) acquiring multi-dimensional historical data of past-year poverty-stricken students, (2) preprocessing the acquired historical data of past-year poverty-stricken students to construct a student characteristic matrix S; (3) dividing multi-dimensional historical data of the past poor students into three categories according to the poor degree, labeling the labels of the categories of the poor students, and constructing a training data set; (4) designing a BP-Adaboost classification model, and training the BP-Adaboost classification model by using a data set constructed by the extracted poverty-suffering characteristic matrix of each poverty-suffering degree in the previous year; (5) the training model is used for assisting the identification of poverty. According to the invention, a BP-Adaboost-based multi-classification model is designed by utilizing student behavior data generated by students at schools, and the model can quickly and accurately classify the students into three categories of poverty, so that the situation of poverty of the students is judged to assist the poverty management staff of the colleges and universities to make decisions.

Description

Poverty-poverty identification method based on multi-classification BP-Adaboost

Technical Field

The invention belongs to the technical field of feature extraction and classification algorithms, and particularly relates to a poverty-poverty identification method based on multi-classification BP-Adaboost.

Background

The subsidy of students is an important content and an important measure for overcoming poverty and hardness, promoting education fairness and further realizing social fairness. The poverty-stricken determination of colleges and universities is the basic work of effective implementation of national student subsidy policies and is important content for promoting the precision of student subsidy. Currently, most of the poverty-stricken assessment of colleges and universities is performed by class public assessment and institution instructor after relevant evidence is presented in villages and towns where students are located. The identification mode has the problems of poor identification deviation, easy doping of individual subjective feelings in each link evaluation, giving up evaluation due to self-esteem of poor students and the like, so that the fairness, the efficiency and the accuracy of the poor students are influenced.

The coming of big data era and the advanced learning method are mature day by day, new ideas and technical supports are brought to the funding work of poverty-suffering, and new opportunities are brought to colleges and universities to utilize big data and the advanced learning method to promote the rapid, convenient, efficient and accurate funding work. At present, the information-based construction of colleges and universities has been developed greatly, all behaviors of students in a campus can generate data, various characteristics of the students are recorded, the data reflect the real conditions of the students, the data can be reasonably applied to assist the poor students in the identification process to a certain extent, the identification result is more real and objective, and more help is provided for the really poor students.

At present, the diagnosis and assessment work of poverty and poverty school is still in an exploration stage by using a big data means and a machine learning method for assisting the diagnosis and the assessment, and a unified diagnosis and assessment method is not available in China. Although some techniques provide some points and ideas, none of them can meet the practical application or are difficult to implement, for example: the application number 201810972342.8 and the patent name are patent application documents of a student poverty degree prediction method based on machine learning, although the student poverty degree prediction is carried out aiming at behavior data generated by students at a school, the required data are various, dozens of types of data are required to be used, the data dimension disaster is easily caused, and the realization difficulty is increased.

Therefore, the effective and accurate realization of poverty-stricken birth identification by using a big data means and a machine learning method becomes the key for researching the auxiliary poverty-stricken birth accurate subsidization.

Disclosure of Invention

In order to solve the problems that high-dimensional poverty-suffering data are difficult to process and poverty-suffering difficulty is difficult to accurately subsidize in the prior art, the invention provides a poverty-suffering identifying method based on multi-classification BP-Adaboost.

In order to achieve the purpose, the invention adopts the technical scheme that:

a poverty-suffering recognizing method based on multi-classification BP-Adaboost is characterized by comprising the following steps:

step 1, obtaining historical behavior data of students, and obtaining multidimensional historical data of poverty students in the past year, wherein the multidimensional historical data of the poverty students in the past year comprise the family condition and the economic condition of the students, the campus consumption condition, the student score condition and the basic information of the poverty students;

the specific steps of acquiring past year poverty-stricken multidimensional historical data and establishing a poverty-stricken feature matrix are as follows:

1) extracting the family condition and the economic condition of the student, including whether the student is a solitary child or not, whether the student is an orphan or not, whether the card is established or not, whether the student has disability or illness or not, whether the parent has disability or illness or not, whether the student is a specially-trapped support person in urban and rural areas or not, and whether the student is a lowest life guarantee family in urban and rural areas or not; extracting campus consumption conditions including total consumption amount, maximum daily consumption amount, average daily consumption amount, maximum monthly consumption amount and average daily consumption times; extracting student achievement conditions including achievement points, average achievement of a scholarly period and the number of hung disciplines; extracting basic conditions of poverty and poverty, including whether to enter a school through a green channel or not and whether to transact a biographical loan or not;

2) let E be the student family situation and economic situation data set₁，e₂，…，e_nWhere n denotes the student number, e_nWhether the disease is a solitary child, whether the disease is an orphan, whether a card-setting impoverished user is established, whether a knight or a pacifying child, whether a student has disability or illness, whether parents have disability or illness, whether people are particularly stranded in urban and rural areas, and whether the lowest life guarantee is established in urban and rural areasA matrix of households;

3) let campus consumption data set C ═ { C ═ C₁，c₂，…，c_nWhere n denotes the student number, c_nIs a matrix composed of total consumption, maximum daily consumption, average daily consumption, maximum monthly consumption and average daily consumption;

4) let student achievement situation data set G ═ G₁，g₂，…，g_nWhere n denotes the student number, g_nIs a matrix composed of achievement points, average achievement of the scholarly period and the number of the hanging departments;

5) let poverty basic situation data set B ═ B₁，b₂，…，b_nWhere n denotes the student number, b_nIs a matrix formed by whether the green channel enters the study or not and whether the biographical loan is transacted or not;

step 2, preprocessing the past year poverty and habitability multi-dimensional historical data collected in the step 1; the method comprises the following specific steps:

1) processing missing values in the data set, wherein the missing values enable data to lose part of information, and filling missing empty fields by using an average value;

2) removing repeated data, sequencing the poor and sleepy data of the previous year according to the serial numbers of students, detecting whether records are repeated or not by comparing whether adjacent records are similar or not, and deleting repeated records if the records are repeated;

3) carrying out feature coding on a student family condition and economic condition data set E and a poor and sleepy life basic condition data set B, and adopting a one-hot coding mode;

4) normalization, namely normalizing the campus consumption condition data set C and the student achievement condition data set G by using a Sigmoid function;

5) a student family condition and economic condition data set E and a campus consumption condition data set

Student achievement situation data set

Merging the poverty-poverty basic situation data sets B into a student characteristic matrix S;

step 3, dividing the past year poverty-poor multi-dimensional historical data into three categories according to poverty degrees, labeling student poverty-poor category labels, and constructing a training data set, wherein the specific steps are as follows:

classifying the students into three levels according to the grade of the past year poverty, namely non-poverty, general poverty and special poverty, and using one-hot coding as a class label of the poverty of the students to construct a training data set T, wherein T is { (x)₁，y₁)，…，(x_i，y_i)，…，(x_n，y_n) Where the data x is input_iRandomly extracted from student feature matrix S, label y_i∈ {001, 010,011 }, where 001,010,011 correspond to poverty, general poverty, and special poverty, respectively, n is the data amount, and the data amount in T is 70% of the student feature matrix;

step 4, designing a BP-Adaboost classification model, and training the BP-Adaboost classification model by using the data set constructed by the poverty-suffering characteristic matrix of each poverty-suffering degree in the past year extracted in the step one, wherein the method specifically comprises the following steps:

1) inputting training data set T, initializing weight D ═ W of training data₁₁，…，W_1i，…，w_1n) Wherein w is_1i1/N, i is 1,2, … N, N represents the amount of data in the student feature matrix S; meanwhile, setting the iteration number M to be 1, and setting the total iteration number to be M, wherein the M is 10;

2) starting iteration, and adopting a three-layer neural network, wherein the neural network adopts a BP neural network and comprises an input layer, a hidden layer and an output layer, the input layer is provided with 17 nodes, the hidden layer is provided with 18 nodes, and the output layer is provided with 3 nodes;

3) training the training data set with weight distribution to obtain a weak classifier: g_m(x) The method comprises the following steps X → {001, 010,011 }, where 001,010,011 correspond to poverty, general poverty, and extra poverty, respectively;

4) calculating training data in the current classifier G_m(x) Error rate of:

5) calculation of G_m(x) Coefficient α of_m：

Wherein K represents the species of poverty, α_mRepresents G_m(x) Importance in the final classifier, α_mWith err_mDecreasing and increasing, i.e. the smaller the classification error rate, the greater the contribution of the classifier in the final classifier;

6) updating the weight distribution of the training data set:

D_m+1＝(W_m+1，1，…，W_m+1，i，…，W_m+1，N)，

W_m+1，ican be converted to the following formula:

from this, the basic classifier G_m(x) The weight of the misclassified samples is enlarged, and the weight of the correctly classified samples is reduced, so that the BP-Adaboost classification model focuses more on the misclassified samples, and the misclassified samples play a greater role in the next round of learning, thereby improving the classification capability of the classification model;

Z_mis a normalization factor:

it makes D_m+1Becoming a probability distribution;

7) judging whether to terminate the iteration, when M is less than M, thenSkipping to the 3 rd step in the step 3), and continuing to iterate when the iteration time m is m + 1; otherwise, terminating iteration, finishing the training of the BP-Adaboost classifier, and obtaining the final classifier

Step 4, training the model for assisting poverty-stricken life determination, and specifically comprising the following steps:

1) extracting the family condition and economic condition of the student to be identified, including whether the student is a solitary child or not, whether the student is an orphan or not, whether the student is a card-setting poor family or not, whether the student is disabled or ill or not, whether the parent is disabled or ill or not, whether the urban and rural particularly-stranded support personnel exist or not, and whether the urban and rural lowest life support family exists or not; extracting campus consumption conditions including total consumption amount, maximum daily consumption amount, average daily consumption amount, maximum monthly consumption amount and average daily consumption times; extracting student achievement conditions including achievement points, average achievement of a scholarly period and the number of hung disciplines; extracting basic conditions of poverty and poverty, including whether to enter a school through a green channel or not and whether to transact a biographical loan or not;

2) preprocessing the acquired student data and constructing a student characteristic matrix S;

3) inputting the student feature matrix S to be classified into the trained BP-Adaboost classification model to obtain a recognition result, if the output result is 1, the student is not poverty, if the output result is 2, the student is general poverty, and if the output result is 3, the student is particularly poverty.

The campus consumption condition data set C and the student achievement condition data set G are normalized by using a Sigmoid function; the method comprises the following specific steps:

1) normalizing each item of data in the campus consumption data condition data set C by using Sigmoid

For the normalized student campus consumption data,

the normalized campus consumption data situation data set is recorded as

2) Normalizing each item of data in the student achievement condition data set G by using Sigmoid

For the normalized student achievement situation data,

the normalized campus consumption data situation data set is recorded as

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention provides a poverty-suffering organism identification method based on multi-classification BP-Adaboost, which changes the traditional poverty-suffering organism identification mode and overcomes the artificial subjectivity by adopting a machine learning method in the identification process; compared with the existing method for identifying poverty-stricken students by machine learning, the method selects key factors in poverty-stricken student identification, reduces data dimensionality of students and avoids dimensionality disaster in machine learning; the method takes BP-Adaboost as a classifier, has higher classification precision, and effectively improves the accuracy of poverty-stricken birth determination.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flowchart of the training of the BP-Adaboost classification model.

Detailed Description

The present invention will be further described with reference to the following embodiments and the accompanying drawings, but the present invention is not limited to the following embodiments.

A poverty-suffering recognizing method based on multi-classification BP-Adaboost comprises the following steps:

step (1): collecting historical data of poverty and poverty in the past year; the past-year poverty-poor student multi-dimensional historical data comprises student family conditions, economic conditions, campus consumption conditions, student score conditions and poverty-poor student basic information, and a past-year poverty-poor student feature matrix is established; the establishment of the classification model in the invention is constructed based on the characteristics of poverty-stricken birth data, so that the accurate selection of basic data lays a foundation for the accurate classification of late poverty-stricken birth, and the specific steps are as follows (1.1) to (1.6):

(1.1) extracting the family condition and the economic condition of the student, including whether the student is a solitary child or not, whether the student is an orphan or not, whether the card-setting poverty-stricken is established or not, whether the student has disability or illness or not, whether the parent has disability or illness or not, whether the urban and rural area is particularly stranded for the support personnel or not, and whether the urban and rural area is the lowest life guarantee family or not; extracting campus consumption conditions including total consumption amount, maximum daily consumption amount, average daily consumption amount, maximum monthly consumption amount and average daily consumption times; extracting student achievement conditions including achievement points, average achievement of a scholarly period and the number of hung disciplines; extracting basic conditions of poverty and poverty, including whether to enter a school through a green channel or not and whether to transact a biographical loan or not;

(1.2) set the student family situation and economic situation data set E ═ E₁，e₂，…，e_nWhere n denotes the student number, e_nThe method is a matrix which consists of whether the student is a solitary child or not, whether the student is an orphan or not, whether a card-setting poverty-stricken user is established or not, whether the student is a burning man or a pacifying child or not, whether the student is disabled or sick or not, whether parents are disabled or sick or not, whether the urban and rural particularly-sleepy support personnel exist or not and whether the urban and rural lowest life support family exists or not, and a student family condition and economic condition data set E is established;

(1.3) setting campus consumption condition data set C ═ C₁，c₂，…，c_nWhere n denotes the student number, c_nThe campus consumption condition data set C is established by a matrix consisting of total consumption amount, maximum daily consumption amount, average daily consumption amount, maximum monthly consumption amount and average daily consumption times;

(1.4) set student achievement situation data set G ═ G₁，g₂，…，g_nTherein ofn represents the student number, g_nThe student achievement situation data set G is a matrix consisting of achievement points, average achievement of a scholarly period and the number of the department hanging, and is established;

(1.5) setting the poverty basic situation data set B ═ B₁，b₂，…，b_nWhere n denotes the student number, b_nEstablishing a poverty-poor student basic condition data set B by a matrix formed by whether the green channel enters the school or not and whether the student loan is transacted or not;

step (2): data obtained in specific practice often has missing values and repeated values, for example, student consumption information is missing due to a school canteen card reader fault, so that data preprocessing is required before data is used, and preprocessing has no standard flow, and only a data preprocessing process is designed for the flow related to the invention, and the specific process is as described in steps (2.1) to (2.5):

(2.1) missing values in a data set are processed, the missing values enable data to lose partial information, and some models with poor robustness can not calculate the data due to the missing values, the campus consumption condition data and student performance condition data related to the method are possibly subjected to data missing due to acquisition equipment or other reasons, and missing empty fields are filled by using an average value;

(2.2) removing repeated data, sequencing the poor and sleepy data in the past year according to the student numbers, detecting whether the records are repeated or not by comparing whether the adjacent records are similar or not, and deleting the repeated records if the records are repeated;

(2.3) carrying out feature coding on the student family condition and economic condition data set E and the poverty-poor life basic condition data set B, and adopting a one-hot coding mode;

(2.4) data normalization is to adjust some characteristics of attribute values, the data is scaled to fall into a small specific interval, in the specific implementation, the campus consumption condition data set C and the student achievement condition data set G need to be normalized by using a Sigmoid function, and the specific steps are described as the step (2.4.1) and the step (2.4.2):

(2.4.1) use of each item of data in the campus consumption data situation dataset CSigmoid is normalized by

For the normalized student campus consumption data,

the normalized campus consumption data situation data set is recorded as

(2.4.2) normalizing each item of data in the student achievement situation data set G by using Sigmoid

For the normalized student achievement situation data,

the normalized campus consumption data situation data set is recorded as

(2.5) data sets E and E of family conditions and economic conditions of students and data sets of campus consumption conditions

Student achievement situation data set

and (3): dividing the poverty-poor living data in the student feature matrix S into three classes according to the national poverty-poor living resource assistant standard, namely poverty-poor, general poverty-poor and special poverty-poor, and using one-hot coding as the class label of the poverty-poor of the student to construct a training data set T, wherein T is { (x)₁，y₁)，…，(x_i，y_i)，…，(x_n，y_n) Where the data x is input_iRandomly extracted from student feature matrix S, label y_i∈ {001, 010,011 }, where 001,010,011 correspond to poverty, general poverty, and special poverty, respectively, n is the data amount, and the data amount in T is 70% of the student feature matrix;

and (4): as shown in fig. two, a BP-Adaboost poor living classification model is designed, and the classification model is trained by using data with weights, and the specific steps are as follows:

(3.1) inputting training data set T, initializing weight D ═ W of training data₁₁，…，W_1i，…，w_1n) Wherein w is_1i1/N, i is 1,2, … N, N represents the amount of data in the student feature matrix S; meanwhile, setting the iteration number M to be 1, and setting the total iteration number to be M, wherein the M is 10;

(3.2) starting iteration, adopting a three-layer neural network, wherein the neural network adopts a BP neural network and comprises an input layer, a hidden layer and an output layer, the input layer is provided with 17 nodes, the hidden layer is provided with 18 nodes, and the output layer is provided with 3 nodes;

(3.3) training the training data set with weight distribution to obtain a weak classifier: : g_m(x) The method comprises the following steps X → {001, 010,011 }, where 001,010,011 correspond to poverty, general poverty, and extra poverty, respectively;

(3.4) calculating the training data in the current classifier G_m(x) Error rate of:

(3.5) calculation of G_m(x) Coefficient α of_m：

K denotes species of poverty, 1,2, 3 denote poverty, general poverty and special poverty, α_mRepresents G_m(x) Importance in the final classifier, α_mWith err_mDecreasing and increasing, i.e. the smaller the classification error rate, the greater the contribution of the classifier in the final classifier; (3.6) updating the weight distribution of the training data set:

D_m+1＝(w_m+1，1，…，W_m+1，i，…，W_m+1，N)，

W_m+1，ican be converted to the following formula:

Z_mis a normalization factor:

it makes D_m+1Becoming a probability distribution;

(3.7) judging whether to terminate iteration, and when M is less than M, skipping to the step (3.3), wherein the iteration time M is M +1, and continuing to iterate; otherwise, terminating iteration, finishing the training of the BP-Adaboost classifier, and obtaining the final classifier

And (4): the method comprises the following steps of obtaining data of students to be identified, preprocessing the data of the students, inputting the preprocessed data into a classification model, and using a classification result for auxiliary identification of poverty-stricken students, wherein the specific steps are as follows:

(4.1) extracting the family condition and economic condition of the student to be identified, including whether the student is a solitary child or not, whether the student is an orphan or not, whether the student is a card-setting impoverished or not, whether the student is disabled or ill or not, the level of the disabled or ill degree of the student, whether the parent is disabled or ill or not, whether the parent is disabled or ill degree, whether the person is particularly stranded in urban and rural areas or whether the family is the lowest life guarantee family in urban and rural areas; extracting campus consumption conditions including total consumption amount, maximum daily consumption amount, average daily consumption amount, maximum monthly consumption amount and average daily consumption times; extracting student achievement conditions including achievement points, average achievement of a scholarly period and the number of hung disciplines; extracting basic conditions of poverty and poverty, including whether to enter a school through a green channel or not and whether to transact a biographical loan or not;

(4.2) preprocessing the acquired student data, wherein the preprocessing step comprises missing value processing, duplicate removal, feature coding and normalization, and constructing a student feature matrix S;

(4.3) inputting the student feature matrix S to be classified into the trained BP-Adaboost classification model to obtain a confirmation result, if the output result is 1, the student is not poverty, if the output result is 2, the student is general poverty, and if the output result is 3, the student is special poverty;

(4.4) actually examining the identification result of the classification model, submitting the discovered suspected invisible poverty and false identification student lists to college managers for processing, and continuously adjusting the model according to the feedback verification condition;

while the foregoing shows and describes the principles of the present invention, together with the advantages thereof, the embodiments of the invention are not limited by the foregoing examples, which are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this disclosure.

Claims

1. A poverty-suffering recognizing method based on multi-classification BP-Adaboost is characterized by comprising the following steps:

2) let E be the student family situation and economic situation data set₁，e₂，…，e_nWhere n denotes the student number, e_nThe system is a matrix consisting of whether the child is solitary or not, whether the impoverished user is established in a file card or not, whether the parent is suffering from disability or illness or not, whether the support personnel are particularly stranded in cities and countryside or not and whether the lowest life support family is in cities and countryside or not;

4) normalization, namely, normalizing the campus consumption condition data set C and the student achievement condition data set G by using a Sigmoid function, and recording the normalized campus consumption condition data set as

Student achievement situation data set

Student achievement situation data set

classifying the students into three levels according to the grade of the past year poverty, namely non-poverty, general poverty and special poverty, and using one-hot coding as a class label of the poverty of the students to construct a training data set T, wherein T is { (x)₁，y₁)，…，(x_i，y_i)，…，(x_n，y_n) Where the data x is input_iRandomly extracted from student feature matrix S, label y_i∈ {001, 010,011 }, where 001,010,011 corresponds to no poverty, generally poverty, and particularly poverty, respectively, and n is the amount of data.

4) calculating training data in the current classifier G_m(x) Error rate of:

wherein y is_i∈ 001,010,011, where 001,010,011 corresponds to no poverty, general poverty, and special poverty, respectively, and n is the number of data;

5) calculation of G_m(x) Coefficient α of_m:

K denotes species of poverty, α_mRepresents G_m(x) Importance in the final classifier, α_mWith err_mDecreasing and increasing, i.e. the smaller the classification error rate, the greater the contribution of the classifier in the final classifier;

6) updating the weight distribution of the training data set:

D_m+1＝(w_m+1，1，…，w_m+1，i，…，w_m+1，N)，

w_m+1，ican be converted to the following formula:

Z_mis a normalization factor:

it makes D_m+1Becoming a probability distribution;

7) judging whether to terminate the iteration when m<When M is needed, jumping to the 3 rd step in step 3), and continuing the iteration when the iteration time M is M + 1; otherwise, terminating iteration, finishing the training of the BP-Adaboost classifier, and obtaining the final classifier

2. The poverty-identifying method based on multi-classification BP-Adaboost as claimed in claim 1, wherein the campus consumption condition data set C and student achievement condition data set G are normalized by using Sigmoid function; the method comprises the following specific steps:

For the normalized student campus consumption data,

the normalized campus consumption data situation data set is recorded as

For the normalized student achievement situation data,

the normalized campus consumption data situation data set is recorded as