Detailed Description
The invention is described in further detail below with reference to the figures and preferred embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
As shown in fig. 1, an embodiment of the present invention provides a method for classifying web pages, which includes the following steps:
s101, acquiring webpage link information;
specifically, in this embodiment, the web page is an advertisement page;
s102, inputting the acquired webpage link information into a text classification model for classification, and outputting a site classification result corresponding to the webpage link information;
the text classification model is trained based on a Boosting integration method.
According to the method, the constructed text classification model trained based on the Boosting integration method is on line, advertisement page link information contained in bidding requests in real-time logs is classified by using the text classification model, site classification is carried out in real time, and therefore advertisement page links of different site categories can be handed to crawlers of corresponding categories for information crawling processing, classified storage of webpage data is facilitated, management, follow-up query processing and the like are facilitated, information crawling processing can be directly omitted for webpages of the categories with poor advertisement page launching effects, waste of resources is avoided, processing efficiency is improved, load capacity of balanced crawlers can be greatly improved, bandwidth is reasonably utilized, and downstream algorithm tasks are supported; in addition, the method adopts the text classification model trained based on the Boosting integration method to realize the webpage classification, so that the processing rate and the accuracy of the classification are very high.
In a preferred embodiment, the web page link information includes a website title keyword list corresponding to the web page link, where the keyword list includes at least one keyword.
In a preferred embodiment, the text classification model is a text classification model constructed by the following construction step S100:
and S1001, acquiring a training data set.
Specifically, as for the step S1001, it preferably includes:
s10011, data acquisition step: selecting data stored in the last r days from a persistence layer database as an original data set;
the data feature components (site, title, manual _ site _ cat, and predict _ site _ cat) of the original data set are respectively a site link site, a site title, a manually classified site category manual _ site _ cat, and a model predicted site category predict _ site _ cat, that is, an original data in the original data set includes four kinds of information, i.e., a site link, a site title, a manually classified site category, and a model predicted site category;
s10012, data preprocessing: removing original data of which the site is null and the title is null or both the preset _ site _ cat and the manual _ site _ cat are null;
s10013, Chinese word segmentation step: sequentially performing Chinese word segmentation and stop word removal processing on title to obtain a processed data set A, wherein a data characteristic component (cat: title _ keywords) of the processed data set A is a data (sample) in the data set A, and the data (sample) comprises a category cat of a site and a keyword list (site title keyword list) obtained after the stop word removal of the site title;
s10014, data set preparation: dividing the processed data set A obtained in the step S10013 into a training data set X and a testing data set TE according to a preset proportion, such as 3: 1;
specifically, the data of 3/4 in the processed data set a is divided as a training data set X, the data of 1/4 is divided as a test data set T; the text classification method comprises the steps that title _ keywords are used as input data, a cat is used as output data, namely, the title _ keywords in a training data set X are used as training input data, the cat in the training data set X is used as training output data, and a text classification model is trained by utilizing the training input data and the training output data; and the title _ keywords in the test data set T are used as test input data, the cate in the test data set T is used as test output data, and the test input data and the test output data are used for testing the text classification model obtained after training.
S1002, inputting the training data set X into a text classification model by using a Boosting integration method for training to obtain a trained text classification model H (X). As shown in fig. 2, it is a schematic diagram of a text classification model used in this embodiment, and the text classification model is implemented by classifying a plurality of basic classifiers H1、H2、……、HnAfter iterative training, carrying out integrated processing on the basic classifier obtained after each iterative training to obtain a classification model; and n is the total number of the basic classifiers, namely the number of the basic classifiers adopted in the text classification model.
Specifically, as for the step S1002, it preferably includes:
s10021, training a current basic classifier by using a current sub-sample set, and calculating the error rate of the basic classifier obtained after training;
the sub-sample set is obtained by distributing corresponding weights to the samples contained in the training data set X; preferably specificallyAfter corresponding weights are distributed to samples contained in the training data set X, N is randomly extracted from the training data set X1Taking the samples as a sub-sample set, wherein each sample has a corresponding weight; the initial weight of the sample is 1/m, i.e. in the first iteration, the weight of the sample distributed to X is 1/m, at this time, N randomly extracted from X1Taking the samples as a first subsample set; wherein m is the total number of samples contained in the training data set X;
specifically, in the training data set X, each sample has a corresponding weight, where the set of weight vectors is referred to as D, specifically, D1,D2,……,DTThe T (T ═ 1,2,3, …, T) th round weight vector sets are respectively, and for example, D1 is the weight of the sample in 1 st round iteration;
in the 1 st iteration, m samples exist in a training data set X, after the weight of each sample is initialized to be 1/m, a first sub-sample set S is obtained through random extraction1I.e. at this time, in iteration 1, S1The ith sample x iniThe corresponding weight is D1(xi)=1/m;
Then, the first subsample set S1Input to a first basic classifier H1Training is carried out, i.e. using the first set of subsamples S1For the first basic classifier H1Training is carried out, after one round of training, the first basic classifier H obtained after the training is finished is obtained1Calculating the error rate; if the current is the t-th iteration, the current t-th sub-sample set S is usedtInput to the current j (j ═ 1,2,3, …, n) th basic classifier HjTraining is carried out, namely, the current t-th sub-sample set S is utilizedtFor the current j basic classifier HjTraining is carried out, after one round of training, the jth basic classifier H obtained after the training is finished is subjected tojCalculating the error rate;
since the number of basic classifiers included in the text classification model is n, and the number of iteration rounds T is usually greater than n, the number of the iteration rounds T is larger than nWhen the iteration training of the (n + 1) th round is performed, the training of the 1 st basic classifier is returned, and in this case, the training of the basic classifier of the (n + 1) th round is equivalent to the training of the basic classifier H of the iteration training in the t-th round, that is, in the t-th round of iteration, the basic classifier H of the iteration training is performedjA basic classifier, referred to as the t-th round;
in a preferred embodiment, the error rate is calculated as follows:
in the formula, epsilon (t) is the error rate corresponding to the basic classifier of the tth round after training; if t is n +1, namely epsilon (t) is the error rate corresponding to the basic classifier of the n +1 th round after training;
Dt(xp) Is the sample x with the p-th classification result as error in the sub-sample setpI.e. in the training of the t iteration, the p-th classification result in the current sub-sample set is the wrong sample xpThe weight of (c); wherein p is ∈ [1,2, …, N]N is the number of samples with wrong classification results in the sub-sample set;
t belongs to [1,2, …, T ], and T is iteration round number;
htis the basic classifier of the t-th round;
ht(xp) Is a classification result predictor, specifically, it is using sample xpIn the training, htOutputting a classification result prediction value;
ypis the true value of the classification result;
[ht(xp)≠yp]=-1;
s10022, when the calculated error rate is converged to the threshold value range, ending the training, and executing the step S10026;
s10023, when the calculated error rate is not converged in the threshold range, executing the step S10024;
s10024, updating the weights of the samples contained in the current sub-sample set according to the calculated error rate, so as to increase the weight of the sample with the classification result as error;
specifically, updating the weight of each sample contained in the current sub-sample set according to the calculated error rate, and increasing the weight of the sample with the classification result of error;
in a preferred embodiment, the weights of the samples included in the current sub-sample set are updated, wherein the weight update formula is as follows:
wherein,αtthe weight corresponding to the basic classifier expressed as the t-th round after training;
k is the number of categories (i.e., the number of classification categories) output by the text classification model, and if the text classification model can realize classification of 3 categories, k is 3;
Ztis to makeA normalization factor of (c);
Dt(xi) Is the ith sample x in the subsetiI.e. the ith sample x in the current subset during the training of the t iterationiThe weight of (c); wherein i belongs to [1,2, … ], N1],N1The total number of samples of the sub-sample set;
Dt+1(xi) Is a sub-sampleCollect the ith sample xiThe weight of the (t + 1) th iteration, i.e. the ith sample x in the sub-sample setiAn updated weight;
ht(xi) Is a classification result predictor, specifically, it is using sample xiIn the training, htOutputting a classification result prediction value;
yiis the true value of the classification result;
from the above formula, for the sample with correct classification result, in the next iteration, i.e. the t +1 th iteration, the corresponding weight isAnd then [ h ]t(xi)=yi]1 is ═ 1; for the sample with the wrong classification result, in the next iteration, namely the t +1 th iteration, the corresponding weight isAnd then [ h ]t(xi)≠yi]=-1;
S10025, after the updated weight is distributed to the samples in the training data set, obtaining the next sub-sample set, inputting the next sub-sample set into the next basic classifier, and returning to execute the step S10021;
specifically, the updated sample weight is distributed to the corresponding sample in the training data set to replace the current weight of the corresponding sample in X, and N is randomly extracted from X again after updating1Obtaining a next sub-sample set by the samples, inputting the next sub-sample set into a next basic classifier, and training the next basic classifier by using the next sub-sample set to realize the next round of iterative training;
s10026, integrating a plurality of basic classifiers to obtain a trained text classification model;
specifically, after T rounds of iteration, when the error rate converges to within the threshold range, the model training process is ended, at this time, multiple rounds of iterative training on n basic classifiers are completed, and then the basic classifiers obtained by each round of iterative training are integrated, that is, the basic classifiers of the 1 st, 2 nd, 3 rd, … … th and T rounds are integrated to obtain the final required text classification model; the basic classifiers obtained by each iteration training are integrated to obtain the final required text classification model H (x), and the adopted integration formula is as follows:
wherein Y ∈ Y indicates that the prediction result belongs to the label set Y.
For the n basic classifiers used in the text classification model, they may be the same or different basic classifiers, and in order to further improve the classification accuracy, it is preferable to select a basic classifier with an error rate range of [0,1/k ] as the basic classifier needed to be used in the text classification model of this embodiment, where k is the number of classes (i.e., the number of classification classes) output by the text classification model; for example, the classification effect (i.e. the error rate range of the two basic classifiers, namely the SVM and the TextGrocery) meets the condition of being within [0,1/k ], so that the SVM and/or the TextGrocery can be selected as the basic classifiers; the experimental result shows that the integrated classifier obtained by integrating the SVM and/or the TextGrocery weak classifiers has better classification effect and higher accuracy than the SVM/TextGrocery alone, which shows that the classification effect of the integrated classifier is better than that of the single basic classifier. Therefore, the classification scheme of the invention can improve the accuracy of text classification by a Boosting integration method on the basis of single textGrocery.
In a preferred embodiment, the constructing step S100 further includes:
s1003, performing ten-fold cross validation on the text classification model H (X) after training in the step S1002 by using a training data set X;
specifically, after the training data set X is scattered into ten parts, ten-fold cross validation is performed on the text classification model H (X) by using the ten parts of data so as to validate the accuracy and stability of the model;
s1004, when the model H (x) passes the verification, testing the text classification model passing the verification by using a test data set TE;
specifically, the test data set TE is input into the text classification model after passing the verification, and the accuracy, the recall rate, and F1-score of the text classification model are calculated and compared with the accuracy, the recall rate, and F1-score of the plurality of basic classifiers described in the above step S10026, so as to test the accuracy of the model and ensure the accuracy of the model classification.
The webpage classification scheme can be applied to an advertisement bidding system for site classification of advertisement pages, can also be applied to other systems for site classification of webpages in other fields (such as games, shopping and the like), and has wide application range and high compatibility.
As shown in fig. 3, an embodiment of the present invention further provides a web page classification system, which includes:
an obtaining module 201, configured to obtain webpage link information;
the processing module 202 is configured to input the obtained webpage link information into a text classification model for classification processing, and output a site classification result corresponding to the webpage link information;
the text classification model is trained based on a Boosting integration method.
In a preferred embodiment, the system further comprises a construction module for constructing the text classification model.
The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.
As shown in fig. 4, an embodiment of the present invention further provides a terminal, including:
at least one processor 301;
at least one memory 302 for storing at least one program;
when executed by the at least one processor 301, the at least one program causes the at least one processor 301 to implement a method step of web page classification as described in the above method embodiments.
The contents in the foregoing method embodiments are all applicable to this terminal embodiment, the functions specifically implemented by this terminal embodiment are the same as those in the foregoing method embodiments, and the beneficial effects achieved by this terminal embodiment are also the same as those achieved by the foregoing method embodiments.
Embodiments of the present invention further provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the steps of a method for classifying web pages as described in the above method embodiments.
The contents in the above method embodiments are all applicable to the present storage medium embodiment, the functions specifically implemented by the present storage medium embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present storage medium embodiment are also the same as those achieved by the above method embodiments.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.