CN114548565A

CN114548565A - Express prediction method based on random forest

Info

Publication number: CN114548565A
Application number: CN202210173732.5A
Authority: CN
Inventors: 李武; 张仲; 王晓飞; 狄筝
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-05-27

Abstract

The invention discloses an express prediction method based on a random forest, which comprises the following steps: collecting historical data influencing express delivery singular numbers to form a sample set; screening out feature data from the sample set to form a feature set; constructing an express day delivery prediction function by using a regression prediction method; and solving the delivery amount of the express day by using a random forest method based on the feature set and the express day delivery prediction function. The method is mainly applied to an express service platform, the express singular number can be predicted on the premise of high rapidness and low cost by utilizing the characteristic data to carry out model training, the daily average express quantity can be predicted quickly and accurately by acquiring a random forest algorithm, so that the platform can know the characteristics and the requirements of users more, and then couriers and express vehicles can be dispatched flexibly, the delivery completion time of the users is reduced, and the labor and time cost of the platform are saved.

Description

Express prediction method based on random forest

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an express prediction method based on a random forest.

Background

With the development of social modernization and informatization, the data volume generated by the information management in the express platform is more and more huge. While the technologies such as artificial intelligence and big data analysis are developed, each big unit increasingly pays more attention to the realization of business management by using intelligent products and improves the business handling efficiency by using mass data. However, in the prior art, especially inside the express platform, still there are a lot of mechanized and blind pain points for handling business. How to reduce manpower, material resources and financial resources by using intelligent products in the platform is worthy of deep research. In addition, when a decision maker faces mass data, how to quickly find effective information, search rules in the mass data, and quickly master important information and data details which affect business key characteristics and the like, so that the important point of low efficiency and even wrong study and judgment is avoided.

Disclosure of Invention

The invention provides an express forecasting method based on a random forest, aiming at the problem of waste of express resource allocation such as manpower, material resources and financial resources in the existing large-scale unit, and the express forecasting method can realize the forecasting of express singular number on the premise of high speed, rapidness and low cost. In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

an express prediction method based on a random forest comprises the following steps:

s1, collecting historical data influencing express delivery singularity to form a sample set;

s2, screening out feature data from the sample set to form a feature set;

s3, constructing an express day delivery prediction function by using a regression prediction method;

and S4, solving the delivery quantity of the express delivery day by using a random forest method based on the feature set established in the step S2 and the express delivery day delivery prediction function established in the step S3.

The step S1 includes the following steps:

s1.1, collecting express delivery data sets within a plurality of days of a certain unit;

s1.2, cleaning the sample set collected in the step S1.1 by adopting a mean value substitution method;

and S1.3, summing the express delivery number of each time period every day according to the cleaned sample set to calculate the daily delivery number of the unit.

The features in the feature set include order status, sender department, and sender unit.

In step S3, the expression of the express day delivery prediction function is:

in the formula, alpha_nCentralizing feature x for feature_nE is a random error,

the final predicted delivery amount on the express day.

The step S4 includes the following steps:

s4.1, randomly extracting M samples from M samples in a sample set by adopting a bootstrap method to serve as a sub-training set to construct a decision tree, wherein M is M;

s4.2, synchronously constructing T 1 decision trees by adopting the method of the step S4.1;

s4.3, randomly selecting p features from the n features of the feature set as a node splitting subset, selecting 1 feature with the minimum p feature errors as a node splitting feature according to the square error, and keeping the node splitting until the decision tree can not be split any more, wherein n p;

s4.4, splitting the T decision trees according to the method in the step S4.3 to form a random forest;

s4.5, training each split decision tree in M samples randomly based on an express day delivery prediction function to obtain an express day delivery value corresponding to each decision tree;

and S4.6, averaging the delivery values of the express days corresponding to each decision tree to obtain a predicted value of the delivery amount of the express days.

The invention has the beneficial effects that:

the method is mainly applied to an express service platform, the express singular number can be predicted on the premise of high rapidness and low cost by utilizing the characteristic data to carry out model training, the daily average express quantity can be predicted quickly and accurately by acquiring a random forest algorithm, and the resource preparation can be made for the service in advance by pre-estimating the express singular number; the express delivery singular number forecasting effect is good, the platform can know user characteristics and demands better, accordingly, couriers and express delivery vehicles can be dispatched flexibly, the delivery completion time of users is shortened, and therefore platform manpower and time cost are saved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a prior art random forest.

FIG. 2 is a graph of the predicted results of the present invention.

FIG. 3 is a graph of the predicted results of a model based on a logistic regression algorithm.

FIG. 4 is a graph of the predicted results of a model based on the minimum absolute shrinkage and selection algorithm.

FIG. 5 is a schematic flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

An express prediction method based on random forests is shown in fig. 5 and comprises the following steps:

the express delivery data set comprises a sender unit, a sender department, the number of express deliveries in each time period every day and the order state of each express. The unit name of the sender is the unit name of the sender, and the sender department refers to the department where the sender is located; the order status includes whether the delivery, in transit, order completed.

the cleaning comprises cleaning missing values, cleaning contents inconsistent with the original data types, cleaning out unnecessary data and cleaning logic error data. The logical error data refers to problem data which can be found through simple logical reasoning.

And S1.3, summing the express delivery number of each time period every day according to the cleaned sample set to obtain the unit daily delivery number.

In this embodiment, the excel is used to organize the historical data, the raw data includes 7271 pieces of data and 10 fields, the content covers two units, namely, express delivery inside the unit and express delivery between the two units, and the time span is from 5 months in 2021 to 10 months in 2021.

S2, screening out feature data from the sample set to form a feature set, wherein the feature set adopts feature_iExpressed, the corresponding expression is:

feature_i＝(x₁，x₂，…，x_n)；

in the formula, x_nOne feature in the feature set is represented, and n represents the number of features in the feature set. In this embodiment, n is 3, x₁Is order status, x₂As sender department, x₃Is a sender unit.

the expression of the express day delivery prediction function is as follows:

in the formula, alpha_nCentralizing feature x for feature_nE is a random error,

the final predicted delivery amount on the express day.

S4, as shown in FIG. 1, solving the delivery quantity of express delivery day by using a random forest method based on the feature set established in the step S2 and the express delivery day prediction function established in the step S3, including the following steps:

s4.3, randomly selecting p features from the n features of the feature set as a node splitting subset, selecting 1 feature with the minimum p feature errors as a node splitting feature according to a square error, and keeping the node splitting until the decision tree can not be split;

in this embodiment, 3 features are the root node and the content node, the delivery number per day is the output, i.e., the leaf node, and n p.

and S4.6, obtaining the predicted value of the daily delivery amount of the express by taking the mean value of the daily delivery values of the express corresponding to each decision tree.

The random forest comprises a plurality of decision trees, and a decision tree set is constructed by utilizing a Bootstrap idea, namely, replaced samples form a training set. The random forest is insensitive to noise in a training set and has the characteristic of high training speed, and the model can be trained in parallel by adopting the random forest, so that the training speed is increased, and the effects of quick training and prediction are achieved. Since the random forest is based on multiple decision trees, its algorithm is more robust than a single decision tree algorithm.

The loss function is constructed by Mean Squared Error (MSE) and Mean Absolute Error (MAE), and the difference degree between the prediction algorithm and actual data is judged through the loss function, so that the quality degree of the model can be measured.

The mean square error is calculated as follows:

wherein l ss_{M E}Representing the mean square error loss function, m representing the total number of samples, z_iThe number of actual day express unions corresponding to the ith sample is represented;

the calculation formula of the average absolute error is as follows:

wherein l ss_{M E}The mean absolute error loss function is represented.

As shown in fig. 2 to fig. 4, the present application is compared with Logistic Regression (LR) and Least Absolute Shrinkage and Selection (LASSO) algorithm, and as shown in table 1 below, the experiment shows that the present application has the best effect on MSE and MAE indexes.

TABLE 1 comparison of the results

A sender applies for express delivery through an express platform, a courier receives an order to obtain a courier and distributes the courier to a designated area, a receiver signs up to complete orders and the like, the application is applied, taking the Tianjin university comprehensive service platform researched and developed in the mode as an example, the year is from 2019 to 2020, and when the year is 2020, 874 orders are accumulated, and the reservation waiting time of the receiver and the courier is saved by 5 minutes, so that the total time is saved by about 72 hours, and the platform labor and time cost is greatly saved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An express prediction method based on a random forest is characterized by comprising the following steps:

s2, screening out feature data from the sample set to form a feature set;

2. The random forest based express prediction method of claim 1, wherein the step S1 comprises the following steps:

3. The random forest based courier prediction method of claim 1, wherein the features in the feature set include order status, sender department, sender unit.

4. The random forest based express prediction method of claim 1, wherein in step S3, the expression of the express day delivery prediction function is as follows:

in the formula, alpha_nCentralizing feature x for feature_nE is a random error,

the final predicted delivery amount on the express day.

5. The random forest based express prediction method of claim 1, wherein the step S4 comprises the following steps:

s4.1, randomly extracting M samples from M samples in a sample set by adopting a bootstrap method to serve as a sub-training set to construct a decision tree, wherein M is larger than M;

s4.2, synchronously constructing T-1 decision trees by adopting the method of the step S4.1;

s4.3, randomly selecting p features from the n features of the feature set as a node splitting subset, selecting 1 feature with the minimum error of the p features as a node splitting feature according to the square error, and keeping the node splitting until the decision tree can not be split again, wherein n is larger than p;