CN111898901A

CN111898901A - LightGBM-based quantitative investment calculation method, storage medium and equipment

Info

Publication number: CN111898901A
Application number: CN202010734157.2A
Authority: CN
Inventors: 吴炳鑫; 吝勃; 史维峰
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-11-06

Abstract

The invention discloses a lightGBM-based quantitative investment calculation method, a storage medium and equipment, which are used for acquiring stock historical data and constructing a factor pool, wherein the factor pool comprises a financial index factor, a technical index factor and other factors; carrying out missing value processing and abnormal value processing on the factor data, and then unifying dimensions among different indexes through data standardization to finish data preprocessing; adjusting parameters by using a LightGBM algorithm; and selecting an earning rate index and a risk measurement index, and selecting the high-earning rate index and the low-risk stock for investment. The invention liberates users from traditional manual analysis or qualitative investment and solves the problems of low speed, low accuracy and incapability of parallel computation in the prior art.

Description

LightGBM-based quantitative investment calculation method, storage medium and equipment

Technical Field

The invention belongs to the technical field of financial investment, and particularly relates to a lightGBM-based quantitative investment calculation method, a storage medium and equipment.

Background

Stock investments have been qualitative investing by investors for a long period of time in the past. The qualitative investment is to analyze the basic surface of a certain stock (such as the industry of the company, the core competitiveness of the company, the management level and the profitability of the company, and the like) and combine the current stock price corresponding to the stock to determine whether to buy or sell the stock according to the personal feeling and experience of the investor. In recent years, traditional qualitative investments have not been satisfied by investors in this complex financial environment, and the rapid development of computer technology combined with stock investments has created an emerging concept, quantitative investments.

What is the quantitative investment? The quantitative investment is a process of realizing an investment strategy by utilizing a computer technology and adopting a certain mathematical model.

For example, a stock whose daily volume increases most or stock price increases most is selected as a purchase object by using a mathematical model, and a computer is used to automatically place a purchase order the next day. Quantitative investments have occurred in the early 80 s of the 20 th century, and have been a long-standing development in many years thereafter. The realization of quantitative investment greatly lightens the workload of manual analysis and avoids poor trading effect caused by personal investment experience and artificial emotional fluctuation during trading. Therefore, the investment is quantified, the transaction efficiency and the investment income are improved, and the human errors are reduced.

With the development of society, data gradually permeates all walks of life and becomes more and more important production factor. Machine learning has made a significant breakthrough in recent years, and relates to a variety of theories such as optimization theory and statistics. The method is specially used for researching how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge and information and update the existing model structure to continuously improve the performance of the model. Under the wave of big data, machine learning plays an increasingly important role. Scholars continuously put forward various learning algorithms, the algorithms greatly improve the capability of a computer to extract features from a large amount of data and find implicit laws, machine learning methods in data mining and analysis are more and more widely applied, and quantitative investment just needs to find the implicit laws from a large amount of historical data.

In the prior art, the classical algorithms which can be used for quantifying investment mainly comprise Bayes, SVM, ensemble learning such as random forest, XgBoost and the like. However, the current mainstream technical situation is that bayes and SVMs belong to weak learners and have the problem of low accuracy, the SVM can only carry out two classifications, and when the data volume is large, parallel calculation is not supported, the calculation speed is slow, and the complex situation of stock market is difficult to predict. In random forests, the XgBoost running speed is not fast to be improved. Therefore, aiming at the conditions that the precision is low, the speed is low and only two classifications can be carried out in the prior art, the invention provides a quantitative investment method based on the LightGBM algorithm, and the defects in the prior art are effectively overcome.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a LightGBM-based quantitative investment calculation method, a storage medium and a device for solving the above-mentioned deficiencies in the prior art, and solve the problems of low accuracy, slow speed, non-support of parallel calculation, high accuracy, fast speed and support of parallel calculation existing in the prior quantitative investment technology.

The invention adopts the following technical scheme:

a LightGBM-based quantitative investment calculation method comprises the following steps:

s1, obtaining stock historical data and constructing a factor pool;

s2, carrying out missing value processing and abnormal value processing on the factor data in the step S1, and then unifying dimensions among different indexes through data standardization to finish data preprocessing;

s3, adjusting parameters by using a LightGBM algorithm;

and S4, investment calculation is carried out according to the rate of return index and the risk measurement index.

Specifically, in step S1, the factor pool includes a financial index factor and a technical index factor, and the financial index factor and the technical index factor are specifically:

further, in step S1, the factor pool further includes other factors, and the other factors are specifically:

specifically, in step S2, if the missing value exceeds 50%, the corresponding stock is discarded; and when the missing value is less than 50%, completing by adopting an interpolation method.

Specifically, the abnormal value processing in step S2 is specifically: data outside three quarters and one quarter of the bit lines of the entire data set is discarded.

Specifically, in step S2, the data normalization specifically includes: and scaling each index to be between 0 and 1 according to a data standardization mode.

Specifically, in step S3, if the machine is trained within 24 hours, a grid search method is used to perform parameter adjustment; otherwise, adjusting parameters according to the learning rate, the iteration times, the maximum depth and the regular term in sequence.

Further, the value of the learning rate is selected to be 0.1; for a given learning rate and the number of decision trees, carrying out decision tree specific parameter tuning; then, adjusting and optimizing the regularization parameter of the xgboost; finally, the learning rate is reduced, and ideal parameters are determined; a grid search is used to achieve global optimality.

Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described.

Another aspect of the present invention is a computing device, including:

one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods.

Compared with the prior art, the invention has at least the following beneficial effects:

the lightGBM-based quantitative investment calculation method provided by the invention has the advantages that a machine learning model with stronger performance is applied, the selection of financial index factors and characteristic engineering are combined, the accuracy rate is higher, the speed is higher, and the parallel calculation result is supported.

Furthermore, financial index factors and technical index factors commonly used in the stock market are selected, and relevant knowledge in the financial professional field is utilized to support the machine learning model, so that the model is higher in interpretability.

Furthermore, other factors can be set to supplement the professional knowledge of the model, so that the model has stronger robustness and is compliant with market change.

Furthermore, the missing value is used as a part of the data, if the missing part is less than 50%, the data still has a certain data value, and the missing part can be filled by using an interpolation method, so that the training sample is larger, and the model generalization capability is stronger.

Furthermore, the abnormal value is used as a part of data, but has negative significance for the generation of the model, and some learners are sensitive to the abnormal value, and the abnormal value can cause serious deviation of the model, so that the abnormal value is directly selected to be removed.

Furthermore, data standardization is a common means for feature engineering, because original data often has different independent variable dimensions, certain difficulty is brought to analysis, and because data size is large, a calculation result is not ideal possibly due to rounding errors, the data standardization is beneficial to eliminating influences caused by different dimensions and different orders of magnitude, and unnecessary errors are avoided.

Further, step S3 uses the LightGBM algorithm to discretize the discrete samples into a histogram, where the histogram accumulates the required statistics, and then searches through the optimal segmentation points according to the discrete values of the histogram. The speed is about 8 times that of the conventional Xgboost.

In summary, the present invention frees the user from the traditional manual analysis or qualitative investment (i.e. determining to buy or sell a stock according to the individual feeling and experience of the investor by analyzing the basic surface of a certain stock), and overcomes the problems of slow speed, low accuracy and incapability of parallel computation in the prior art.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The LightGBM is implemented based on a lifting tree algorithm. The lifting tree realizes the optimization process of learning by using an additive model and a forward distribution algorithm, such as XGboost, GBDT and the like. GBDT uses negative gradient as an indicator of partitioning (information gain), and XGBoost uses the second derivative. They have the common disadvantage that calculating the information gain requires scanning all samples in order to find the optimal division point. Their efficiency and scalability is difficult to satisfy in the face of large amounts of data or high feature dimensions. The straightforward way to solve this problem is to reduce the amount of features and data without affecting accuracy, and there is some effort to speed up the boeing process based on data weight sampling, but GBDT cannot be applied without sample weights.

LightGBM solves these problems well, and it mainly contains two algorithms: unilateral Gradient Sampling, Gradient-based One-Side Sampling (GOSS) and mutual exclusion Feature binding, Explicit Feature Bundling (EFB).

Gos (from reduced sample perspective): most of the samples of the small gradient are excluded and only the remaining samples are used to calculate the information gain. The GBDT has no data weight, but each data instance has different gradient, and the instance with large gradient has larger influence on the information gain according to the definition of calculating the information gain, so that when down-sampling, the sample with large gradient (the preset threshold value or the highest percentile interval) should be kept as much as possible, and the sample with small gradient is randomly removed. LightGBM may prove that this measure achieves more accurate results than random sampling at the same sampling rate, especially when the information gain range is large.

EFB (from a feature reduction perspective): the bound mutex feature is replaced with a composite feature. In general application, although the feature quantity is relatively large, since the feature space is very sparse, a lossless method needs to be designed to reduce effective features, especially on the sparse feature space, many features are almost mutually exclusive (for example, many features are not simultaneously nonzero values, such as one-hot), the mutually exclusive features can be bound, finally, the binding problem is reduced to the graph coloring problem, and an approximate solution is obtained through a greedy algorithm.

The invention provides a quantitative investment calculation method based on LightGBM, which is characterized in that a machine learning algorithm is applied to a traditional multi-factor model to construct a multi-factor stock selection model based on the LightGBM algorithm, wherein the essence of the multi-factor stock selection model is to consider the relationship between different types of factors and the profitability of stocks, further establish a classification model about the rise and fall of stocks, screen out a type of stock construction investment portfolio with investment value, and obtain excess income.

Referring to fig. 1, a method for quantitative investment calculation based on LightGBM according to the present invention includes the following steps:

s1, constructing a factor pool by an investor;

the stock market reflects the good and bad of economic conditions to a certain extent, and although the objective law of the stock market in China A is not obvious, the fact that the four factors of market, basic, policy and people can influence the fluctuation of the stock market is real. Price is also true in the securities market where the objective rationale is determined by both demand and supply factors, as is the degree of match between the flow of funds and the quantity of securities. The intrinsic value of a commodity determines the price of the commodity, and a stock on a stock market is also a commodity, and the price of the stock is determined by the operation performance of a company represented by the stock to a large extent, and the operation performance of the company is not only related to the management of the company, but also related to the industry and the whole economic environment, so that the stock price is a basic factor influencing the stock price. Because government laws and regulations, prohibition and support of industry and the like can also influence the business operation performance, government policies can also influence the change of stock price.

For the factors which have great influence on the stock price in the aspects, the factors can be represented as much as possible under the current conditions, and the factors which can be used in the model are as follows:

TABLE 1 financial and technical index factors

TABLE 2 other factors (macroscopical, bond, building, etc.)

In the traditional multi-factor stock selecting method, the effectiveness test of factors is generally emphasized in investment, namely, a stock combination is selected according to single factor value sequencing, the method is circulated for several periods once, and then the income condition of the stock combination is calculated. And according to the condition of the stock combination selected by each factor, screening out factors capable of generating persistent revenue as effective factors for multi-factor stock selection. The factor validity check is mainly limited by the stock classification method and the incomplete factor, and the novel classification algorithm cannot bear the calculation amount brought by high dimension, so that the model efficiency is reduced. The quantitative investment method based on the LightGBM does not need to carry out the validity check of factors, because the constructed factor pool comprises microscopic factors of single stock finance, dividend, momentum, scale, industry and the like, also comprises macroscopic factors of the whole stock market and index factors of dividend confidence, and also comprises more than three hundred factors of national macroscopic economic factors, macroscopic building market prospect degree factors, bond earning rate, deposit benchmark interest rate and the like, and the factor quantity is huge, so that the validity check of each factor cannot be carried out.

The classification algorithm is additionally the lifting algorithm LightGBM algorithm. The main function of the method is to randomly select the characteristics of the factors, so that each tree is built without using all the characteristics, but a certain amount of characteristics are randomly selected, and finally all the trees are integrated to obtain a classification result. Therefore, the invention enables investors to directly construct models without factor screening and factor checking.

The investor selects proper factors as a factor pool according to knowledge and knowledge about the market.

S2, preprocessing data;

the stock pool is selected from all Shanghai depths 300 which are listed before 1 month in 2010 and have not been subjected to stop for more than one month and are normally listed as the stock pool, and is characterized in that the factors are shown in the table, the data time region is from 1 month in 2010 to 12 months and 30 days in 2019, and a data source investor can automatically call a database comprising a Runshi financial economy database, a WIND financial information database and the like.

Data obtained at the initial stage of actual data mining and quantitative investment construction are dirty data, namely the data always have the problems of noise, missing values, non-uniform data dimension and the like to a certain extent, and the existence of the dirty data directly influences the later modeling effect of the user. Therefore, before modeling, data preprocessing is firstly performed, irregular data is processed into regular data, missing values are filled, and dimensions are unified. Data preprocessing generally consists of data cleansing, data integration, data specification, and data transformation.

S201, missing value processing is carried out

The factor data is derived from financial data and quotation data of the Shanghai 300 stocks and data of contemporaneous macroscopical, real estate and bonds, and the like, which comprise a wide range of industries, so that certain factors may exist in the data, and if the missing value exceeds 50%, the corresponding stocks are abandoned; and when the missing value is less than 50%, completing by adopting an interpolation method. The interpolation method adopts SMOTE algorithm.

S202, abnormal value processing

Outliers are defined as data that falls outside of three quarters and one quarter of the bit lines of the entire data set at a point. When the abnormal value is processed, the abnormal value cannot be deleted at one step, because when the abnormal data amount accounts for more than 10% of the total data, the deletion of the abnormal value may affect the structure of the whole data, and further affect the subsequent modeling result. In most cases, specific analyses are typically combined with specific problems.

The large data set is adopted, the abnormal value is taken as a part of the large data set, but the generation of the model is of negative significance, and some learners are sensitive to the abnormal value, and the abnormal value can cause serious deviation of the model, so that the abnormal value is directly selected and removed.

S203, data standardization

The normalization of the data is used to unify the dimensions between different metrics. The difference between data with different dimensions may be large, for example, the height of the student is typically distributed between 120 and 200, while the vision of the eye may be distributed between 0.2 and 1, so if these two data are modeled together, the large difference between them will directly have a large effect on the result. Therefore, in order to eliminate the dimension difference between different indexes, the data needs to be standardized, and each index is scaled to be between 0 and 1 according to the equal proportion.

S3, use of LightGBM algorithm, tuning parameters

The first step is as follows: learning rate and number of iterations

The second step is that: determining max _ depth and num _ leaves

The third step: determining min _ data _ in _ leaf and max _ bin in

The fourth step: determining feature _ fraction, bagging _ freq

The fifth step: lambda _ l1 and lambda _ l2 were determined

And a sixth step: determining min _ split _ gain

The seventh step: reducing learning rate, increasing iteration times, and verifying model

LightGBM core parameters:

config or config _ file: a string giving the path of the configuration file. The default is an empty string.

task: a string of characters that gives the task to be performed. Can be as follows:

'train' or 'training': the representation is a training task. Defaulted to 'train'.

'predict' or 'test': the representation is a predictive task.

'convert _ model': the representation is a model transformation task. The model file is converted to if-else format.

application or objective or app: a string indicating the type of problem. Can be as follows:

"regression" or "regression _ l 2" or "mean _ squared _ error" or "mse" or "l 2_ root" or "root _ mean _ squred _ error" or "rmse': the regression task is represented, but the L2 loss function is used. Defaulted as 'regression'

"regression _ l 1', either mae or mean _ absolute _ error: the regression task is represented, but the L1 loss function is used.

'huber': the regression task is represented, but using the huber loss function.

'poisson': representing the Poisson regression task.

'quintile': representing the quintile regression task.

'quantile _ l 2': the quantile regression task is represented, but the L2 loss function is used.

'map' or 'mean _ absolute _ prediction _ error': representing regression tasks, but using MAPE loss function

'gamma': representing a gamma regression task.

'tweed': representing the tweede regression task.

'binary': representing a binary task, using a logarithmic loss function as an objective function.

'multiclass': representing a multi-classification task, using the softmax function as the objective function, the num _ class parameter must be set

'multiclassova' or 'multiclass _ ova' or 'ovr': representing a multi-classification task, a one-vs-all two-classification objective function is used. The num _ class parameter must be set

'xencopy' or 'cross _ entry': the objective function is cross entropy (with selectable linear weights). The requirement label is a number between [0,1 ].

'xenlambda' or 'cross _ entry _ lambda': instead of parameterized cross _ entry. The requirement label is a number between [0,1 ].

'lambdarak': representing the ordering task. In the lambdarak task, the tags should be of integer type, with larger numbers indicating higher relevance. The label _ gain parameter can be used to set the gain (weight) of the integer label;

boosting or 'boost' or 'boosting _ type': a string of characters giving the base learner model algorithm. Can be as follows:

'gbdt': representing a traditional gradient boost decision tree, with a default value of 'gbdt';

'rf': representing a random forest.

'dart': represents gbdt with dropout;

goss: gbdt for Gradient-based One-Side Sampling;

data or train _ data: a character string giving the file name of the file in which the training data is located; default to a null string; lightgbm will use it to train the model.

valid or test or valid _ data or test _ data: a string representing the file name of the file in which the verification set is located. The default is an empty string. lightgbm will output the metric for that data set. If there are multiple verification sets, then commas are used to separate them.

num _ iterations or num _ iteration or num _ tree or num _ trees or num _ round or num _ rounds or num _ boost _ round are an integer, the iteration number of boosting is given, and the default is 100.

For python/R packets, this parameter is ignored. For python, the input parameter num _ boost _ round of train ()/cv () is used instead.

Internally, lightgbm sets a num _ class × num _ iterations tree for multiclass problem.

left _ rate or ringing _ rate: floating point number, giving the learning rate. Defaults to 1. In dart, it also affects the normalized weights of dropped trees.

num _ leaves or num _ leaf: an integer giving the number of leaves on a tree, default to 31

tree _ leaner or tree: one string, given tree leaner, is mainly used for parallel learning, and defaults to 'serial'. Can be as follows:

'serial': tree leaner of single machine

'feature': feature-parallel tree leaner

'data': data parallel tree leaner

'voting': voting parallelized tree leaner

num _ threads or num _ thread or nthread: an integer, giving the number of threads of lightgbm, default to OpenMP _ default.

For faster speed, it should be set to the true number of CPU cores, not the number of threads (most CPUs use hyper-threading to have each CPU core generate 2 threads).

For parallel learning, not all CPU cores should be used, as this would result in poor network performance device: a string, specifying a computing device. Default to 'cpu'. May be 'gpu' or 'cpu'.

It is proposed to use a max _ bin smaller than 10 to obtain a faster calculation speed

To speed up learning, the GPU defaults to summing using 32-bit floating point numbers, setting GPU _ use _ dp to True starts 64-bit floating point numbers, but reduces training speed.

The LightGBM core parameters are selected as follows:

config: default value

task：‘train’

application：‘binary’

boosting：goss

num_iterations：100

learning_rate：0.1

num_leaves：20

tree_learner：’serial’

num_threads：4

device：’cpu’

LightGBM aims at obtaining better accuracy:

using max _ bin greater than 10

Using a learning _ rate less than 1 and num _ iterations greater than 100

Use of num _ leaves greater than 3

Using training data greater than 1000

Try dart

LightGBM parameter optimization for leaf-wise trees:

num _ leaves: the number of leaf nodes is controlled. It is the main parameter controlling the complexity of the tree model.

If it is level-wise, the parameter is 2depth, where depth is the depth of the tree. However, when the number of leaves is the same, the leaf-wise tree is far deeper than the level-wise tree, which is very likely to result in overfitting. Therefore, num _ leaves should be made smaller than 2 depth. In the leaf-wise tree, there is no notion of depth. Since there is no reasonable mapping from leaves to depth.

min _ data _ in _ leaf: the minimum number of samples per leaf node. A minimum of 1, which is an important parameter for handling overfitting of the leaf-wise tree. Setting it to a larger value, e.g. 5, avoids generating an overly deep tree. But may also result in under-fitting.

max _ depth: the maximum depth of the tree is controlled and this parameter may explicitly limit the depth of the tree.

LightGBM for faster training speed:

using a banding method by setting a banding _ fraction and a banding _ freq parameter

Sub-sampling using features by setting feature _ fraction parameter

Using smaller max _ bin, say 10

Acceleration of data loading in future learning process using save _ bind

The investor can adjust the parameters by himself, and the parameter adjusting process is as follows:

a higher learning rate is selected first. Typically, the learning rate has a value of 0.1; secondly, for a given learning rate and the number of decision trees, carrying out decision tree specific parameter tuning; next, adjusting the regularization parameter of the xgboost; finally, the learning rate is reduced, and ideal parameters are determined; a grid search may be used to achieve global optimality.

Tuning specific parameters of the decision tree: selecting an initial value depth of 5 and a number of plants of 200;

adjusting and optimizing regularization parameters: there were two options, L1 and L2, here L2;

global optimality is achieved using grid search: a grid search may be performed using a machine learning GridSearchCV package.

And S4, determining an evaluation index and determining a quantitative investment method.

And selecting two major indexes of a yield index and a risk measurement index to evaluate the effect.

The rate of return indicators include: total rate of return, annual combined rate of return, relative rate of return. The risk measurement indexes mainly comprise beta coefficients, sharp ratios, maximum withdrawal, information ratios and the like. The method mainly uses three indexes of a sharp ratio, a maximum withdrawal and an information ratio.

And summarizing the related indexes, selecting the stocks with high profitability indexes and low risks for investment, and completing quantitative investment based on the LightGBM method.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Select Shanghai depth 300 stocks which have representative significance in the financial field. 300 total stocks in Shanghai depth 300 are selected, the stock which is returned and stopped is removed from the market in 1 month to 12 months in 2019 in 2010, and the rest of the stock which can be traded is used as a stock pool. Selecting several kinds of factors such as value factors, financial quality factors, growth factors, technical factors, momentum factors and the like as factor pools, and correspondingly constructing training samples by using factor data of the last trading day of each month and stock month yield data of the next month. After the characteristic engineering processing, the stock rise and fall probability of the next month is predicted by the factor data of the current month. The data from 2017 to 2019 are used as the return test, namely the data from 7 years are used for training, and the data from 3 years are used for the return test, and the data of 10 years are counted. And (4) carrying out classified prediction on stock price trends.

And (4) evaluating the stock selection performance by using indexes such as total profit rate, annual profit rate, sharp rate, maximum withdrawal and the like.

1. High speed

In terms of speed, before LightGBM was proposed, the most common lifting algorithm of the mainstream technology is the XgBoost algorithm, which is a decision tree algorithm based on a pre-ordering method. The basic idea of the algorithm for constructing the decision tree is as follows: first, all features are pre-ordered according to their numerical values. Second, the best segmentation point on a feature is found at the cost of O (# data) when traversing the segmentation points. Finally, after finding the best segmentation point of a feature, the data is split into left and right child nodes.

Such a pre-ordering algorithm has the advantage that the segmentation points can be found accurately. But the disadvantages are also evident: first, space consumption is large. Such an algorithm needs to store the feature values of the data and also stores the feature ordering result (for example, to calculate the segmentation points quickly and subsequently, the ordered indexes are stored), which consumes twice as much memory as the training data. Secondly, there is a large overhead in time, and when each division point is traversed, the calculation of the splitting gain is required, which is costly. Finally, it is not friendly to Cache optimization. After pre-sorting, the access of the features to the gradient is a random access, and the access sequence of different features is different, so that the Cache cannot be optimized. Meanwhile, when each layer of long tree is used, an array with indexes from one row to the leaf index needs to be randomly accessed, and the access sequence of different characteristics is different, which also causes a larger Cache miss.

For LightGBM, a histogram algorithm is used, which is to discretize consecutive floating-point feature values into k integers and construct a k-wide histogram. And when data is traversed, accumulating statistics in the histogram according to the discretized value as an index, accumulating the required statistics in the histogram after data is traversed for one time, and traversing and searching for an optimal segmentation point according to the discretized value of the histogram. The LightGBM operating speed is higher than XgBoost.

The histogram only needs to calculate the information gain for the histogram statistics, and compared with the method that all values are traversed each time by the pre-sorting algorithm, the calculation amount of the information gain is much smaller. Compared with the pre-ordering algorithm of the XgBoost, the memory space of the pre-ordering algorithm is much smaller. Because the pre-ordering algorithm needs to store the ordering structure of each feature, the size of the memory required by the pre-ordering algorithm is 2 x # data x # feature 4Bytes, while the histogram algorithm only needs to store the discrete value bin value and does not need the original feature value, so the size of the memory occupied by the histogram algorithm is as follows: # data # feature 1Byte, since the discrete value bin value is sufficient to use the uint8_ t.

When the feature histogram corresponding to the child node is obtained, only the feature histogram of one child node needs to be constructed, the histogram of the child node which is just constructed is subtracted from the histogram of the parent node in the feature histogram of the other child node, and the time complexity is compressed to O (k), wherein k is the number of barrels of the histogram. By using the method, the LightGBM can obtain the histogram of the brother leaf of the leaf with very small cost after constructing the histogram of the leaf, and the speed can be doubled.

Three orders of magnitude famous datasets, namely the Iris dataset of only 234Kb, the MNIST dataset of 121Mb and the highs Bosom dataset of 7.48Gb, are used to compare the speed.

Hardware

A cpu: AMD Ryzen 717008 core 16 thread

gpu：Nvdia gtx 1060 6G

Memory: 8Gb

Software

The system comprises the following steps: windows 10

OpenCL：Cuda 8

Boost Binary：msvc-14.1-64

And (3) calculating software: r language 3.3.3+ Rtools 3.4

Calculating a software package: LightGBM 0.2+ XGboost 0.6-4

GBM algorithm parameter setting:

nrounds＝100

learning rate/eta＝1

early_stopping_rounds＝10

transmission: iris (3 class)/MINST (10 class)/highs Bosom (2 class)

The evaluation results were as follows:

the speed increase brought about by using the LightGBM-CPU in small and medium datasets and the LightGBM-GPU in large datasets is significant.

2. High accuracy

Compared with traditional Bayes, SVM and other algorithms, the LightGBM algorithm belongs to a strong learner, and has higher accuracy.

3. Supporting parallel computing

LightGBM natively supports parallel learning, currently supports both feature parallelism (Featrue parallelism) and Data parallelism (Data parallelism), and still another is vote-based Data parallelism (Votingparallel).

The main idea of feature parallelism is to find optimal segmentation points on different machines and different feature sets respectively, and then synchronize the optimal segmentation points among the machines. In the data parallel process, different machines construct histograms locally, then carry out global combination, and finally find the optimal segmentation point on the combined histograms.

LightGBM is optimized for both these parallel methods:

in the feature parallel algorithm:

a. each Worker searches an optimal division point { characteristics, threshold } on a local characteristic set;

b. carrying out communication integration of each partition locally to obtain an optimal partition;

c. an optimal partitioning is performed.

In the data parallel, a disperse protocol (Reduce scatter) is used for distributing the task of histogram merging to different machines, communication and calculation are reduced, and the histogram is used for making a difference, so that half of communication traffic is further reduced. Voting-based data parallel (Voting) further optimizes the communication cost in the data parallel, so that the communication cost becomes a constant level. When the data volume is large, the voting parallelism can be used to obtain a very good acceleration effect.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A LightGBM-based quantitative investment calculation method is characterized by comprising the following steps:

s1, obtaining stock historical data and constructing a factor pool;

s3, adjusting parameters by using a LightGBM algorithm;

2. The LightGBM-based quantitative investment calculation method of claim 1, wherein in step S1, the factor pool comprises a financial index factor and a technical index factor and other factors, and the financial index factor and the technical index factor are specifically:

3. the LightGBM-based quantitative investment calculation method of claim 1 or 2, wherein in step S1, the factor pool further comprises other factors, and the other factors are specifically:

4. the LightGBM-based quantitative investment calculation method of claim 1, wherein in step S2, if the missing value exceeds 50%, the corresponding stock is discarded; and when the missing value is less than 50%, completing by adopting an interpolation method.

5. The LightGBM-based quantitative investment calculation method of claim 1, wherein the abnormal value processing of step S2 is specifically as follows: data outside three quarters and one quarter of the bit lines of the entire data set is discarded.

6. The LightGBM-based quantitative investment calculation method of claim 1, wherein in step S2, the data normalization is specifically: and scaling each index to be between 0 and 1 according to a data standardization mode.

7. The LightGBM-based quantitative investment calculation method of claim 1, wherein in step S3, if the machine is trained within 24 hours, the parameters are adjusted by using a grid search method; otherwise, adjusting parameters according to the learning rate, the iteration times, the maximum depth and the regular term in sequence.

8. The LightGBM-based quantitative investment calculation method of claim 7, wherein the value of the learning rate is selected to be 0.1; for a given learning rate and the number of decision trees, carrying out decision tree specific parameter tuning; then, adjusting and optimizing the regularization parameter of the xgboost; finally, the learning rate is reduced, and ideal parameters are determined; a grid search is used to achieve global optimality.

9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.

10. A computing device, comprising:

one or more processors, memory, and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-8.