CN116821818A

CN116821818A - Form data classification method and device, equipment and storage medium

Info

Publication number: CN116821818A
Application number: CN202310720362.7A
Authority: CN
Inventors: 黄慧
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-29

Abstract

The embodiment of the application discloses a form data classification method, a form data classification device and a storage medium, wherein the method comprises the following steps: form data are obtained from the target information; determining a classification result of the target information through a target model based on the first characteristic information of the form data; the first parameter and the second parameter are adjusted, and the classification tendency of the target model is changed so as to adjust the classification result; the first parameter and the second parameter are related to sample loss.

Description

Form data classification method and device, equipment and storage medium

Technical Field

The embodiment of the application relates to a machine learning technology, in particular to a form data classification method, a form data classification device, form data classification equipment and a storage medium.

Background

At present, the classification of the model for various category characteristics is generally to uniformly adjust the classification result, and cannot be accurately adjusted.

Disclosure of Invention

In view of this, embodiments of the present application provide a form data classification method, apparatus, device, and storage medium.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a form data classification method, where the method includes:

form data are obtained from the target information;

determining a classification result of the target information through a target model based on the first characteristic information of the form data;

the first parameter and the second parameter are adjusted, and the classification tendency of the target model is changed so as to adjust the classification result; the first parameter and the second parameter are related to sample loss.

In some embodiments, the method further comprises: according to the classification model after the t-1 th iteration training, combining a first module of the t iteration training, and determining the classification model after the t iteration training; the target model is a classification model trained in the last iteration, and t is a positive integer greater than or equal to 2.

In some embodiments, the method further comprises: determining a second module according to the predicted probability value of the positive sample; determining a third module according to the linear coefficient relation of the second module; determining the first module according to the linear coefficient relation of the third module; wherein the third module includes the first parameter and the second parameter.

In some embodiments, the first and second parameters are related to sample loss, comprising: the first parameter is a loss proportion parameter between positive and negative samples; the second parameter is a loss contribution parameter in the frangible sample.

In some embodiments, the method further comprises: preprocessing the original characteristic information of the target information to determine second characteristic information; wherein each piece of original feature information comprises at least one feature field, and the preprocessing comprises at least one of removing unnecessary feature fields, combining similar feature fields and supplementing missing feature fields by an algorithm.

In some embodiments, the method further comprises: determining a correlation matrix according to the second characteristic information; according to the correlation matrix, determining the correlation between the characteristic fields; and removing the characteristic field with small correlation from the second characteristic information, and determining the first characteristic information.

In a second aspect, an embodiment of the present application provides a form data sorting apparatus, including:

an acquisition unit for acquiring form data from the target information;

the determining unit is used for determining a classification result of the target information through a target model based on the first characteristic information of the form data;

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program executable on the processor, and where the processor implements steps in the above method when the program is executed.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs steps in the above method.

Drawings

FIG. 1 is a schematic diagram of an implementation flow of a form data classification method according to an embodiment of the present application;

FIG. 2 is a second schematic diagram of an implementation flow of a form data classification method according to an embodiment of the present application;

fig. 3A is a schematic diagram III of an implementation flow chart of a form data classification method according to an embodiment of the present application;

FIG. 3B is a schematic diagram of a correlation matrix according to an embodiment of the present application;

FIG. 3C is a diagram illustrating the correlation between bad orders and feature variables according to an embodiment of the present application;

FIG. 3D is a schematic diagram of the result of chi-square verification according to an embodiment of the present application;

FIG. 3E is a schematic diagram of a feature importance map according to an embodiment of the present application;

FIG. 3F is a schematic diagram of a confusion matrix after model classification by focus loss function training according to an embodiment of the present application;

fig. 4 is a schematic diagram of a composition structure of a form data sorting device according to an embodiment of the present application;

fig. 5 is a schematic diagram of a hardware entity of an electronic device according to an embodiment of the present application.

Detailed Description

The technical scheme of the application is further elaborated below with reference to the drawings and examples. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, suffixes such as "module", "component", or "unit" for representing elements are used only for facilitating the description of the present application, and have no specific meaning per se. Thus, "module," "component," or "unit" may be used in combination.

It should be noted that the term "first\second\third" related to the embodiments of the present application is merely to distinguish similar objects, and does not represent a specific order for the objects, it being understood that the "first\second\third" may interchange a specific order or sequencing, where allowed, so that the embodiments of the present application described herein can be implemented in an order other than that illustrated or described herein.

Currently, the after-sales service order identification process may be: after the customer puts forward the after-sales service order, in most cases, after-sales support personnel need to check, process and update the service order one by one manually. The disadvantages of this procedure are: the process relies on a great deal of manual support, and if feedback is not timely, the optimal delivery time of the order can be missed, so that the customer experience is affected. Further, from the standpoint of service order management and service response speed, it is necessary to introduce an accurate automatic identification optimization method into after-sales service.

Based on the above, the embodiment of the application provides a form data classification method, which can change the classification tendency of a target model by adjusting a first parameter and a second parameter, and adjust the classification result of the target model on target information, so that on the premise that the overall accuracy of the model is difficult to be improved, the classification tendency can be adjusted by adjusting the parameters, and the different classification accuracy of a plurality of categories can be adjusted, so that the classification accuracy of one or more categories is further improved. For example, for the cancelled order, the classification accuracy of two classification results, such as the cancelled order and the non-cancelled order, can be adjusted by adjusting the first parameter and the second parameter, so as to further improve the classification accuracy of the cancelled order and avoid the excessive subsequent labor cost caused by false report of the order. The functions performed by the method may be performed by a processor in the electronic device invoking program code, which of course may be stored in a storage medium in the electronic device.

Here, the electronic device may be various types of devices having information processing capability, such as a smart phone, a navigator, a tablet computer, a wearable device, a laptop portable computer, a floor sweeping robot, a smart kitchen, a smart home, an automobile, a server or a server cluster, and the like.

Fig. 1 is a schematic diagram of an implementation flow chart of a form data classification method according to an embodiment of the present application, as shown in fig. 1, where the method includes:

step S101, form data are obtained from target information;

in the embodiment of the application, the target information can be various types of information containing form data. For example, the target information may be a pre-sales order or an after-sales order for the merchandise, and for another example, the target information may be a registration order for a hospital patient, or the like. Taking the target information as an example of the after-sales order, the form data includes, but is not limited to: user name, product name, order information (including time, etc.), product warranty items, etc.

Step S102, determining a classification result of the target information through a target model based on the first characteristic information of the form data;

Here, the trained target model may determine a classification result of the target information based on the first characteristic information of the form data. And, the classification tendency of the target model may be changed according to a change of a first parameter and a second parameter, the first parameter and the second parameter being related to a sample loss when the target model is trained, and the classification tendency of the target model may adjust a classification result of the target information.

That is, the embodiment of the present application can obtain the classification result of the target information according to the first characteristic information of the form data and the target model after the training, and can change the classification tendency of the target model according to changing the first parameter and the second parameter related to the sample loss. In one embodiment, the target information is an after-market order, and the trained target model may classify the after-market order as either "cancelled" or "received". The predicted tendency to "cancel" or "receive" for a portion of the order's characteristics is not strong and therefore the above problem can be balanced by varying the classification tendency by adjusting the first and second parameters associated with sample loss, thereby varying the classification result.

In the embodiment of the application, by the classification method in the steps S101 to S102, the classification tendency of the target model can be changed based on sample loss, and the classification result can be flexibly adjusted according to actual requirements, so that the classification accuracy is improved.

In some embodiments, the first and second parameters are related to sample loss, comprising:

the first parameter is a loss proportion parameter between positive and negative samples;

the second parameter is a loss contribution parameter in the frangible sample.

Here, the positive and negative samples refer to positive and negative samples in the training process. Sometimes there is an imbalance between positive and negative samples in the training sample set, for example, long tail data has serious problems of imbalance between positive and negative samples, and typically there are many samples in one class and few samples in the other class. For both positive and negative samples, there is a sample loss (the sample loss can be determined based on the predicted value and the actual label of the sample), and the first parameter is a loss ratio parameter between the positive and negative samples, which can inhibit the problem of unbalance of the number of positive and negative samples. Of course, besides the problem of unbalanced number of positive and negative samples, there may be easily-classified samples and difficultly-classified samples in the training data set, and the second parameter is a loss contribution parameter in the easily-classified samples, where the loss contribution parameter can control the problem of unbalanced sample distinction difficultly.

Based on the foregoing embodiment, the form data classification method provided by the embodiment of the present application further includes: step S111, according to the first classification model after the t-1 th iteration training, combining the first module of the t iteration training, and determining the second classification model after the t iteration training; the target model is a classification model trained in the last iteration, and t is a positive integer greater than or equal to 2.

Here, step S111 is performed at least before step S102. The execution order of the step S101 and the step S111 is not limited in the embodiment of the present application, and the step S101 may be executed first, then the step S111 may be executed, or the step S111 may be executed first, then the step S101 may be executed. In general, step S111 is executed (i.e., the target model is trained first to obtain a trained target model), then form data to be predicted in the target information is obtained, and a classification result is obtained through the trained target model.

In the embodiment of the present application, the target model may be any integrated learning model, such as a Catboost model, xgboost model, adaboost model, etc. Thus, the classification model F obtained after the training of the t-1 th iteration can be used ^t-1 And a first module for the t-th iterative training to determine a classification model F obtained after the t-th iterative training ^t . It should be noted that, the classification model obtained after the 1 st iteration training is determined according to the initial classification model and the first module combined with the 1 st iteration training. The initial classification model is a model that has not been trained by any iteration. The embodiment of the application adopts the serial iteration of the classification model for training. The process of the t-th iterative training is as follows:

step S11, a training sample set is obtained, wherein the training sample set comprises a positive sample and a negative sample, the positive sample carries a positive label, and the negative sample carries a negative label;

step S12, based on the first characteristic information of each sample in the training sample set, passing through a classification model F ^t-1 Determining a sample classification result set of the training sample set;

step S13, through a first module, based on the error between the sample classification result set and the labels carried by each sample in the training sample set, performing a classification on the classification model F ^t-1 Updating the model parameters of the model to obtain a classification model F with updated parameters ^t 。

In the embodiment of the application, the objective function h of the t-th round iteration ^t The expression of (2) is shown in formula (1):

h ^t ＝argmin _h∈HE (L(y,Ft ^-1 (x)+h(x))) (1)

wherein argmin represents E (L (y, F) ^t-1 (x) +h (x)) reaches the minimum value; e (a) is the optimal solution for a, L (b) is the loss function for b, y is the label of the sample, F ^t-1 (x) For sample classification results for sample x, h (x) is the gradient.

The approximation of each round of loss is fitted using the negative gradient of the loss function whose gradient expression is shown in equation (2):

substituting the above formula into the objective function to obtain an approximately fitted objective function as shown in formula (3):

h ^t ＝argmn _h∈H E(-g ^t (x,y)-h(x)) ² (3)

the final classification model is shown in formula (4):

F ^t (x)＝F ^t-1 (x)+h ^t (4)

according to the embodiment of the application, a focus Loss function (Focal Loss) is introduced into a classification model to train the model efficiently, and the obtained classification Loss is shown as a formula (5):

FL(P _t )＝-α _t (1-P _t ) ^γ log (P _t ) (5)

wherein P is _t Finger sample x _t Predictive probability values belonging to positive samples, (1-p) _t ) Gamma is the modulation factor (modulating factor), alpha _t Is a loss proportion parameter (i.e. a first parameter) between positive and negative samples, is used for adjusting the proportion between positive and negative sample losses, gamma is a loss contribution parameter (i.e. a second parameter) in the easily separable samples, is used for reducing the loss contribution of the easily separable samples, alpha _t And gamma are both adjustable parameters.

When P _t Trend toward 1, i.e. the sample is a readily distinguishable sample, when the modulation factor (1-P _t ) Gamma tends to 0, indicating a smaller contribution to the loss, i.e. a reduced proportion of the loss of the easily distinguishable samples. When P _t Is small, i.e. if a certain sample is misclassified to a positive sample, the modulation factor (1-P _t ) ^γ Is trending towards 1, the impact on losses is greatly reduced.

Specifically, for the above focus loss function, i.e., formula (5), the first derivative is calculated to obtain a gradient expression as shown in formula (6):

and deriving the first derivative to obtain the second derivative of the focus loss function as shown in a formula (7):

the classification model obtained after the introduction of the focus loss function is shown in formula (8):

F ^t (x)＝F ^t-1 (x)+H ^t (8)

in some embodiments, the first module of the t-th iterative training may be the second derivative H of the focus loss function ^t Further, the embodiment of the application uses the first module H ^t Replacing the objective function h in the ensemble learning model ^t The classification tendency degree is changed by adjusting the first parameter and the second parameter related to sample loss, so that the situation that the predicted sample classification result is inaccurate due to obvious unbalance of positive and negative samples in a training sample set during training can be improved, a more flexible and accurate classification effect is realized, and the accuracy of a target model obtained through training is improved.

Based on the foregoing embodiment, the form data classification method provided by the embodiment of the present application further includes:

step S121, determining a second module according to the predicted probability value of the positive sample;

here, the target model needs to be used in the training process to a training sample set including a plurality of training samples, that is, a positive sample and a negative sample. In one embodiment, the training sample set may include a plurality of after-market service orders, wherein the orders that need to be processed are positive samples and the orders that need to be cancelled are negative samples.

In the embodiment of the application, the second module can be determined according to the predicted probability value of the positive sample. In one embodiment, the focus loss function FL (P) used in training the target model can be determined according to the predicted probability value of the positive sample and shown in the above formula (5) _t )。

Step S122, determining a third module according to the linear coefficient relation of the second module;

here, the linear coefficient relationship refers to the existence of a first-order functional relationship between two variables, and is referred to as the existence of a linear coefficient relationship between them. Wherein the linear coefficient relationship includes, but is not limited to: derivative coefficient relationship, addition, subtraction, multiplication, division coefficient relationship, differential coefficient relationship, and the like. In an embodiment, the second module may be subjected to derivative, to obtain a derivative result, and then the third module is determined according to the derivative result. The third module may be, for example, the first derivative G of the focus damage function shown in equation (6) above ^t 。

Step S123, determining a first module according to the linear coefficient relation of the third module; wherein the third module includes a first parameter and a second parameter;

in an embodiment, the third module may be subjected to derivative to obtain a derivative result, and then the first module is determined according to the derivative result. The first module may be, for example, the second derivative H of the focus loss function shown in equation (7) above ^t 。

It should be noted that, in the embodiment of the present application, the second module includes at least a first parameter related to the sample loss (i.e., α in the formula _t ) And a second parameter (i.e., γ in the formula), and accordingly, the third module and the first module each include at least a first parameter and a second parameter related to the sample loss, and the classification tendency is changed by adjusting the first parameter and the second parameter. According to the embodiment of the application, the third module is determined according to the predicted probability value of the positive sample and the linear coefficient relation of the second module, and the first module is further determined according to the linear coefficient relation of the third module, so that the first module can be used for adjusting the proportion between positive and negative sample losses and reducing the loss contribution of the easily separable samples, the classification tendency of the target model can be flexibly adjusted, and the classification result of the target model obtained based on the third module is more accurate and accords with an application scene.

In some embodiments, referring to fig. 2, fig. 2 is a second implementation flow chart of a form data classification method according to an embodiment of the present application, where the form data classification method provided by the embodiment of the present application further includes, after step S101:

step S201, preprocessing original characteristic information of the target information, and determining second characteristic information; wherein each piece of original feature information comprises at least one feature field, and the preprocessing comprises at least one of removing unnecessary feature fields, combining similar feature fields and supplementing missing feature fields by an algorithm.

In an embodiment of the present application, the target information includes original feature information, and each original feature information includes at least one feature field. For example, the target information is an after-sales service order, and the order includes original characteristic information such as a user name, a user age, a user gender, an order time, related products and the like. Preprocessing the original feature information includes at least one of removing unnecessary feature fields, merging homogeneous feature fields, and algorithmically supplementing missing feature fields. That is, the pretreatment includes, but is not limited to: removing weight; removing only one valued feature field; cleaning relevant fields of the bill types and the service types, and merging the similar types; the missing values are processed. For example, if there is only one candidate for a certain content in the order, the candidate needs to be preprocessed to remove unnecessary feature fields.

The second characteristic information in the embodiment of the application is used for determining the first characteristic information, unnecessary characteristic fields (such as information which is irrelevant to the first characteristic information) in the target information can be removed, repeated similar characteristic fields are combined or missing characteristic fields are supplemented through an algorithm by preprocessing the target information, so that the target information is screened and perfected to obtain the second characteristic information, and therefore, the first characteristic information can be extracted from the target information and the target information can be classified based on the first characteristic information more accurately and efficiently.

Based on the foregoing embodiment, the form data classification method provided by the embodiment of the present application further includes, after step S201:

step S211, determining a correlation matrix according to the second characteristic information;

step S212, determining the correlation among the characteristic fields according to the correlation matrix;

step S213, removing the characteristic field with small correlation from the second characteristic information, and determining the first characteristic information.

Here, the correlation matrix may be drawn according to the second feature information to view the correlations between the variables, and filter features that have no or less influence on the order result. And the secondary verification can be performed through the histogram, so that the rationality of feature filtering is ensured.

Wherein if P related variables exist, the related coefficients between the two variables are calculated to obtainAnd a correlation coefficient. For example, they are sequentially arranged in a numerical matrix according to the number sequence of the variables, and the matrix is called a correlation matrix.

In some embodiments, the chi-square verification may be further performed on the feature information after the feature field with small correlation is removed, where the step is to find features with significant differences among groups in the feature information, and further screen associated feature values; the chi-square verification is to count the deviation degree between the actual observed value and the theoretical inferred value of the sample, the deviation degree between the actual observed value and the theoretical inferred value determines the chi-square value, and if the chi-square value is larger, the deviation degree of the actual observed value and the theoretical inferred value is larger; conversely, the smaller the deviation of the two; if the two values are completely equal, the chi-square value is 0, indicating that the theoretical value is completely in line. And the characteristic information after the chi-square verification is taken as first characteristic information.

In some embodiments, the importance degree of different features in the feature information after removing the feature field with small correlation may be used to perform feature selection, and the selected feature information is used as the first feature information and is used to input the first feature information into the target model for classification. The method comprises the steps of preprocessing original features of target information to obtain second feature information, removing feature fields with small correlation in the second feature information, and determining to obtain first feature information, so that the determined first feature information can more accurately represent the correlation features used for classifying the target model in the target information, and model prediction is more efficient and accurate.

Based on the foregoing embodiment, the embodiment of the present application further provides a form data classification method, where a decision tree algorithm may be introduced in the after-sales service order processing, and the flow of the after-sales service order is perfected through machine learning; predicting the refused rate of the order, and reducing the manual participation part of the business; the after-sales order performance time is shortened, and therefore customer experience is improved.

Compared with the related art, the scheme in the embodiment of the application can determine the main influencing factors in the service order identification process through machine learning, thereby reducing artificial subjective influence and optimizing the service flow. And the first module can be combined, under the condition that positive and negative samples of the after-sales service order data set are obviously unbalanced, the problems of unbalanced quantity of the positive and negative samples, difficult/easy-to-distinguish unbalance of the control samples and the like are better suppressed by utilizing the weight factor of the focus loss function, the false alarm rate and the false alarm rate are effectively controlled, and the automatic identification of the service order is realized.

Fig. 3A is a schematic diagram III of an implementation flow chart of a form data classification method according to an embodiment of the present application, as shown in fig. 3A, the method includes the following steps:

step S301, defining a target variable;

step S302, acquiring an after-sales service order in a period of time, and performing exploratory analysis;

step S303, data preprocessing;

step S304, feature engineering;

step S305, introducing a Catboost model of a focus loss function;

step S306, analyzing the result.

Next, a form data sorting method in the embodiment of the present application will be described in detail with reference to fig. 3A.

Defining the target variable may include: based on the historical after-market service orders, orders with status updated to "completed" within 30 days of the system are defined as "good orders" and orders with status updated to "cancelled" within 30 days of the system are defined as "bad orders". The data acquisition and exploratory analysis may include: selecting modeled order data, and acquiring an after-sales service order and a data field related to the service order from a customer relationship management system; and checking the positive and negative sample numbers ('good order' and 'bad order') of the service order data set according to the obtained order data, knowing the general condition of the data set, including the missing value, the average number and the median of the fields in the data set, and checking the distribution condition of the numerical variable and the category variable in the data set to obtain the multidimensional category type feature.

Data preprocessing, including but not limited to: removing weight; removing only one valued feature field; cleaning relevant fields of the bill types and the service types, and merging the similar types; the missing values are processed.

For the selection work of the features, the method can comprise the following steps: and drawing a correlation matrix, checking the pairwise correlations among variables, and filtering the characteristics with no influence and small influence on the order result. The secondary verification can be performed through the histogram, so that the rationality of feature filtering is ensured.

Here, the correlation matrix is summarized as: p related variables, and obtaining the related coefficients between the two variablesAnd a correlation coefficient. For example, they are sequentially arranged in a numerical matrix according to the number sequence of the variables, and the matrix is called a correlation matrix. Commonly used letter R denotes:

on the diagonal line from the upper left to the lower right, the correlation of two identical variables is carried out, the values of the two identical variables are 1, and the correlation coefficients of the parts above the diagonal line are symmetrical. Analysis of the correlation matrix is an important starting step for solving problems involving multiple independent variables.

Illustratively, FIG. 3B is a schematic diagram of a correlation matrix for characterizing correlations between variables, as shown in FIG. 3B, in accordance with an embodiment of the application; wherein the variables include variable 1 through variable 14. Among them, it can be seen that the correlation between the variable 1 and the variable 1 is 1, the correlation between the variable 1 and the variable 2 is-0.098, and the like. The correlation between the variables can be obtained by a correlation matrix shown in fig. 3B, and a description thereof will not be given. FIG. 3C is a diagram illustrating the correlation between bad orders and feature variables according to an embodiment of the present application. Each column of the abscissa in fig. 3C characterizes a feature variable, and the ordinate is the correlation between the corresponding feature variable and the bad order. The correlation between the feature variables shown in fig. 3C and the bad order corresponding to left to right on the abscissa sequentially decreases. In practical implementation, the selection of feature variables, that is, the selection of features, may be performed together in conjunction with fig. 3B and 3C.

And then, carrying out chi-square verification on the order data obtained after filtering, wherein the step is to find out the characteristic of significant difference among groups in the order data and further screen the associated characteristic values.

The chi-square test is to count the deviation degree between the actual observed value and the theoretical inferred value of the sample, the deviation degree between the actual observed value and the theoretical inferred value determines the chi-square value, and if the chi-square value is larger, the deviation degree of the actual observed value and the theoretical inferred value is larger; conversely, the smaller the deviation of the two; if the two values are completely equal, the chi-square value is 0, indicating that the theoretical value is completely in line.

Fig. 3D is a schematic diagram of a result of chi-square verification according to an embodiment of the present application, wherein for a first result "the chi-square critical value of the enhanced by INW in the chi-square verification result is 0.00, which is smaller than 0.05, which indicates that there is a significant difference between the INW groups, a cross analysis may be performed," specifically, for "Cancelled by DeliverInTime" this order data, the Cancelled chi-square critical value is 0.16, which is larger than 0.05, which indicates that there is no significant difference between the "delevellntime" this order data group, and a cross analysis may not be performed, so that this order data may be further analyzed to screen the feature value.

Next, feature selection is completed using the feature importance. Illustratively, fig. 3E is a schematic diagram of a feature importance map according to an embodiment of the present application, where, as shown in fig. 3E, the horizontal axis of the importance map is the feature importance degree and the vertical axis is the feature; among them, features 1 to 13 are included. As can be seen from the figure, feature 1 is the feature having the greatest degree of feature importance, and the degrees of feature importance of features 1 to 13 decrease in order. According to the feature importance diagram, required features can be intuitively and conveniently screened out.

In an embodiment, features having a feature importance level greater than or equal to an importance level threshold may be selected as features; the feature importance levels of the plurality of features may also be ranked (e.g., ranking from high to low in fig. 3E), with the plurality of features with the feature importance levels ranked N in front being selected features, where N may be a positive integer greater than or equal to 2; in addition, N may also be a target condition that is satisfied, and the number of all the features is recorded as M, and the target condition may be that N/M is greater than or equal to a target ratio, for example. In an actual scenario, a person skilled in the art may select the features according to actual needs, which is not specifically limited by the embodiment of the present application.

In practical implementation, the data modeling is performed according to the obtained data, and in an embodiment, the obtained data is split, and the order data ratio of the training set and the test set may be 3:1. The split data can be used for training and predicting by using a Catboost algorithm model. Here, the training process for the Catboost algorithm model may be referred to as the training process for the classification model (target model) in the above embodiment, and will not be described herein.

After the model is obtained, the model continues to be evaluated. In one embodiment, the algorithm model may be evaluated by the F1 score and AUC value obtained from the final target model.

Illustratively, fig. 3F is a schematic diagram of a confusion matrix after model classification trained using a focus loss function according to an embodiment of the present application. As shown in fig. 3F, for the number of sample orders that are cancelled in the actual label and that are not cancelled in the predicted label, it should be understood that the predicted label here is a classification result of the model trained to introduce the focus loss function, that is, the model corresponding to fig. 3F has 137 misclassification results for the samples that should be cancelled. The number of sample orders that are cancelled in the actual label and that are predicted to be cancelled in the label is 608, that is, the model corresponding to fig. 3F has 608 misclassification results for samples that should be not cancelled. Here, the model obtained by training the focus loss function is introduced to control the ratio of the two errors so as to correspondingly adjust according to actual needs. In the embodiment of the application, namely according to the actual application, since the actual order to be cancelled needs to be manually intervened for processing subsequently, the labor cost is high, so that the model is expected to reduce the false alarm rate (namely misclassification) of the order to be cancelled, the first parameter and the second parameter are adjusted to adjust the classification tendency degree by introducing the focus loss function, and the classification result is controlled to be more prone to user experience (receiving the order possibly to be cancelled) or more prone to cost (not receiving the order possibly to be cancelled), so that the classification result can be flexibly adjusted according to the actual requirement. In an actual scenario, a person skilled in the art can perform corresponding parameter adjustment according to actual needs.

Based on the foregoing embodiments, the embodiments of the present application provide a form data sorting apparatus, which includes each module included, each unit included in each module, and each component included in each unit, and may be implemented by a processor in an electronic device; of course, the method can also be realized by a specific logic circuit; in an implementation, the processor may be a CPU (Central Processing Unit ), MPU (Microprocessor Unit, microprocessor), DSP (Digital Signal Processing, digital signal processor), or FPGA (Field Programmable Gate Array ), or the like.

Fig. 4 is a schematic diagram of a composition structure of a form data sorting device according to an embodiment of the present application, as shown in fig. 4, the device 400 includes:

an acquisition unit 401 for acquiring form data from the target information;

a determining unit 402, configured to determine, based on the first feature information of the form data, a classification result of the target information through a target model;

In some embodiments, the apparatus further comprises:

the training unit is used for determining a second classification model after the t-th iterative training according to the first classification model after the t-1 th iterative training and combining the first module of the t-th iterative training; the target model is a classification model trained in the last iteration, and t is a positive integer greater than or equal to 2.

In some embodiments, the apparatus further comprises:

the module determining unit is used for determining a second module according to the predicted probability value of the positive sample;

the module determining unit is further configured to determine a third module according to the linear coefficient relationship of the second module;

the module determining unit is further configured to determine the first module according to a linear coefficient relationship of the third module;

wherein the third module includes the first parameter and the second parameter.

the second parameter is a loss contribution parameter in the frangible sample.

In some embodiments, the apparatus further comprises:

the preprocessing unit is used for preprocessing the original characteristic information of the target information and determining second characteristic information;

wherein each piece of original feature information comprises at least one feature field, and the preprocessing comprises at least one of removing unnecessary feature fields, combining similar feature fields and supplementing missing feature fields by an algorithm.

In some embodiments, the apparatus further comprises:

the feature determining unit is used for determining a correlation matrix according to the second feature information;

the feature determining unit is further configured to determine a correlation between the feature fields according to the correlation matrix;

the feature determining unit is further configured to remove the feature field with small correlation from the second feature information, and determine the first feature information.

The description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, please refer to the description of the embodiments of the method of the present application.

In the embodiment of the present application, if the form data classification method is implemented in the form of a software function module, and sold or used as a separate product, the form data classification method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied essentially or in part in the form of a software product stored in a storage medium, including instructions for causing an electronic device (which may be a personal computer, a server, etc.) to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a ROM (Read Only Memory), a magnetic disk, or an optical disk. Thus, embodiments of the application are not limited to any specific combination of hardware and software.

Correspondingly, the embodiment of the application provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor realizes the steps in the form data classification method provided in the embodiment when executing the program.

Correspondingly, the embodiment of the application provides a readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps in the form data sorting method described above.

It should be noted here that: the description of the storage medium and apparatus embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the apparatus of the present application, please refer to the description of the method embodiments of the present application.

It should be noted that fig. 5 is a schematic diagram of a hardware entity of an electronic device according to an embodiment of the present application, as shown in fig. 5, the hardware entity of the electronic device 500 includes: a processor 501, a communication interface 502 and a memory 503, wherein

The processor 501 generally controls the overall operation of the electronic device 500.

The communication interface 502 may enable the electronic device 500 to communicate with other electronic devices or servers or platforms over a network.

The memory 503 is configured to store instructions and applications executable by the processor 501, and may also cache data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or processed by each module in the processor 501 and the electronic device 500, and may be implemented by FLASH (FLASH) or RAM (Random Access Memory ).

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units. Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The methods disclosed in the method embodiments provided by the application can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.

The features disclosed in the several product embodiments provided by the application can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.

The features disclosed in the embodiments of the method or the apparatus provided by the application can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A form data sorting method, the method comprising:

form data are obtained from the target information;

2. The method of claim 1, the method further comprising:

according to the first classification model after the t-1 th iteration training, combining the first module of the t iteration training, and determining the second classification model after the t iteration training;

the target model is a classification model trained in the last iteration, and t is a positive integer greater than or equal to 2.

3. The method of claim 2, the method further comprising:

determining a second module according to the predicted probability value of the positive sample;

determining a third module according to the linear coefficient relation of the second module;

determining the first module according to the linear coefficient relation of the third module;

wherein the third module includes the first parameter and the second parameter.

4. The method of claim 1, the first and second parameters relating to sample loss, comprising:

the second parameter is a loss contribution parameter in the frangible sample.

5. The method of claim 1, the method further comprising:

preprocessing the original characteristic information of the target information to determine second characteristic information;

6. The method of claim 5, the method further comprising:

determining a correlation matrix according to the second characteristic information;

according to the correlation matrix, determining the correlation between the characteristic fields;

and removing the characteristic field with small correlation from the second characteristic information, and determining the first characteristic information.

7. A form data sorting apparatus, the apparatus comprising:

an acquisition unit for acquiring form data from the target information;

8. An electronic device comprising a memory and a processor, the memory storing a computer program executable on the processor, the processor implementing the steps in the form data sorting method of any one of claims 1 to 6 when the program is executed.

9. A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the form data sorting method of any one of claims 1 to 6.