CN113285845B

CN113285845B - Data transmission method, system and equipment based on improved CART decision tree

Info

Publication number: CN113285845B
Application number: CN202110834148.5A
Authority: CN
Inventors: 苑志超; 朱剑飞; 刘奎
Original assignee: Primate Intelligent Technology Hangzhou Co ltd
Current assignee: Primate Intelligent Technology Hangzhou Co ltd
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2022-01-14
Anticipated expiration: 2041-07-23
Also published as: CN113285845A

Abstract

The application relates to a data transmission method, a system and equipment based on an improved CART decision tree, wherein the method comprises the following steps: obtaining a stopping condition by performing conversion calculation on a preset threshold, reading and counting a preset sample data set according to each sample characteristic, and storing a plurality of statistical results; simplified calculation is carried out according to the statistical result to obtain the Gini index gain of the sample characteristics, whether the dividing condition is met or not is judged according to the Gini index gain and the stopping condition, and if the dividing condition is not met, the calculation of the preset sample data set is stopped; if the data transmission requirement is met, the preset sample data set is divided into two subdata sets, the subdata sets are set as the preset sample data set, and calculation is continued until the CART decision tree is generated.

Description

Data transmission method, system and equipment based on improved CART decision tree

Technical Field

The present application relates to the field of communication network transmission, and in particular, to a data transmission method, system and device based on an improved CART decision tree.

Background

In the world of everything interconnection, great data processing requirements put higher demands on transmission quality and security. At present, in general internet data transmission of TCP/IP protocol, after TCP sends a data packet, it starts a retransmission timer, and puts the packet into a retransmission queue. If the acknowledgement is received for the packet, the timer is cancelled. Otherwise, if the acknowledgement information is not received after the timer expires, the packet is considered to be lost, and then the packet is taken out from the retransmission queue, and is sent again (the timer is started in the same way), however, the transmission quality and the transmission efficiency of data are reduced due to high delay, packet loss and congestion existing in the internet data transmission process based on the retransmission timer, and the frequent packet loss of data caused by various network problems becomes a main factor affecting the experience of a plurality of application clients such as live audio and video broadcast, games and the like.

At present, no effective solution is provided for the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol in the related art.

Disclosure of Invention

The embodiment of the application provides a data transmission method, a system and equipment based on an improved CART decision tree, so as to at least solve the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol in the related art.

In a first aspect, an embodiment of the present application provides a data transmission method based on an improved CART decision tree, where the method includes:

acquiring a preset sample data set for training a decision tree, wherein the preset sample data set comprises a plurality of sample characteristics;

repeatedly executing a preset step, wherein the preset step comprises the following steps:

according to the preset sample data set, performing conversion calculation on a preset threshold value to obtain a stopping condition;

reading and counting the preset sample data set in parallel according to the characteristics of each sample in the preset sample data set, and storing a plurality of statistical results;

according to the statistical result, simplified calculation is carried out through the Gini index (D) of the preset sample data set and the Gini index (D, A) of the sample characteristics, and the gain of the Gini index of the sample characteristics is obtained;

selecting a divided kini index gain from the kini index gains of the plurality of sample characteristics;

judging whether the dividing conditions are met or not according to the dividing kini index gain and the stopping conditions;

if not, stopping the calculation of the preset sample data set, and generating a CART decision tree until the calculation of all the preset sample data sets is stopped;

if so, dividing the preset sample data set into a first sub-sample data set and a second sub-sample data set according to the optimal characteristics and the optimal dividing points, and setting the first sub-sample data set and the second sub-sample data set as the preset sample data set respectively.

In some embodiments, the obtaining of the gain of the kini index by performing simplified calculation on the kini index Gini (D) of the preset sample data set and the kini index Gini (D, a) of the sample features includes:

calculating by the preset Gini (D) and Gini (D, A) of the sample characteristic to obtain the Gain of Gini = Gini (D) -Gini (D, A) of the sample characteristic, namely

Is simplified according to a formula to obtain

Further simplifying the formula obtained by the simplification according to the condition that N and S are constants before each classification, and obtaining the gain of the Gini index as

Wherein N is the number of sample data in the preset sample data set, S is the number of sample data with a positive label in N, and N is_lIs the number of sample data on the left side of the segmentation point in N, N_rIs the number of sample data on the right of the segmentation point in N, S_lIs N_lNumber of sample data of middle and positive label, S_rIs N_rThe number of sample data of the middle positive label.

In some of these embodiments, the kini index gini (d) of the preset sample data set comprises:

the preset Gini index (D) of the preset sample data set is obtained by the formula according to the number of the sample data of the positive label in the preset sample data set

And obtaining N, wherein N is the number of sample data in the preset sample data set, and S is the number of sample data with a positive label in N.

In some of these embodiments, the cuni index Gini (D, a) of the sample features includes:

selecting Gini indexes (D, A) of the sample characteristics respectivelyTaking sample characteristics and segmentation points in the preset sample data set, and passing through a formula

Obtaining N, wherein N is the number of sample data in the preset sample data set, and N is_lIs the number of sample data on the left side of the segmentation point in N, N_rIs the number of sample data on the right of the segmentation point in N, S_lIs N_lNumber of sample data of middle and positive label, S_rIs N_rThe number of sample data of the middle positive label.

In some embodiments, determining whether a division condition is satisfied according to the division kini index gain and the stop condition comprises:

and judging whether the dividing kini index gain is larger than the stopping condition or not, if so, meeting the dividing condition, and if not, not meeting the dividing condition.

In some of these embodiments, after generating the CART decision tree, the method further comprises:

and generating a random forest according to the CART decision trees, and pre-judging whether data are lost in network data transmission or not through the random forest, and retransmitting pre-judged lost data in advance.

In a second aspect, the embodiment of the present application provides a data transmission system based on an improved CART decision tree, the system includes a sample control unit, a sample statistics unit, a storage unit, a calculation unit, a comparison and judgment unit, and a sample classification unit, wherein the system has a plurality of sample statistics units;

the sample control unit is used for acquiring and controlling a preset sample data set required by training a decision tree;

the sample counting unit is used for reading and counting the preset sample data set according to a plurality of sample characteristics in the preset sample data set respectively;

the storage unit is used for receiving and storing the statistical result output by the sample statistical unit;

the calculation unit is used for carrying out simplified calculation through the Gini index (D) of the preset sample data set and the Gini index (D, A) of the sample characteristics according to the statistical result to obtain the gain of the Gini index of the sample characteristics and a calculation stop condition;

the comparison and judgment unit is used for selecting the division kini index gain and judging whether the division condition is met;

the sample classification unit is used for classifying a preset sample data set according to the judgment of the comparison and judgment unit.

In some embodiments, the sample statistics unit is configured to read and count the preset sample data set according to a number of sample features in the preset sample data set, respectively, and includes:

each sample statistical unit is used for reading and counting the preset sample data set according to one sample characteristic, and a plurality of sample statistical units are used for reading and counting the preset sample data set in parallel according to a plurality of sample characteristics;

and storing a plurality of statistical results obtained by reading statistics in the storage unit.

In some of these embodiments, the sample control unit for controlling a preset set of sample data required for training a decision tree comprises:

the sample control unit is used for controlling the sample statistical unit to read and count a preset sample data set according to the sample characteristics under the condition that the idle sample statistical unit exists.

In a third aspect, the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor, when executing the computer program, implements the data transmission method based on the improved CART decision tree according to the first aspect.

Compared with the related art, the data transmission method, the system and the equipment based on the improved CART decision tree provided by the embodiment of the application have the advantages that the preset sample data set used for training the decision tree is obtained, the preset threshold value is converted and calculated according to the preset sample data set to obtain the stop condition, the preset sample data set is read and counted in parallel according to each sample characteristic in the preset sample data set, and a plurality of counting results are stored; according to the statistical result, simplified calculation is carried out through the Gini indexes of the preset sample data set and the Gini indexes of the sample characteristics to obtain the gain of the Gini indexes of the sample characteristics, the divided gain of the Gini indexes is selected from the gain of the Gini indexes of the sample characteristics, whether the dividing condition is met or not is judged according to the divided gain of the Gini indexes and the stopping condition, and if the dividing condition is not met, the calculation of the preset sample data set is stopped; if the first sub-sample data set and the second sub-sample data set are met, the preset sample data set is divided into a first sub-sample data set and a second sub-sample data set according to the optimal characteristics and the optimal dividing points, the first sub-sample data set and the second sub-sample data set are set as the preset sample data set respectively, calculation is continued until a CART decision tree is generated, the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol are solved, the computational complexity of the gain of the kiney index is simplified, the training efficiency of the model is improved, and the quality and the efficiency of real-time data transmission are guaranteed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flow chart of steps of a method of data transmission based on an improved CART decision tree according to an embodiment of the present application;

FIG. 2 is a block diagram of a data transmission system based on an improved CART decision tree according to an embodiment of the present application;

fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.

Description of the drawings: 21. a sample control unit; 22. a sample counting unit; 23. a storage unit; 24. a calculation unit; 25. a comparison and judgment unit; 26. and a sample classification unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

The invention provides a data transmission method based on an improved CART decision tree, which aims to solve the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol.

The communication network transmission field is very sensitive to delay, and the network environment changes rapidly due to different services, so that the training model needs to be adjusted frequently according to the services, and therefore, designing an online training model with high performance and high environment perception speed is an effective means for improving the quality of user experience (QOE).

The decision tree is a basic classification and regression method in a machine learning algorithm, is a basic framework of a plurality of tree promotion algorithms, such as XGboost, random forest and the like, is easy to understand, can construct a model without too many training samples, and has relatively high operation speed.

The cart (classification And Regression tree) algorithm is one of ten classic algorithms for data mining, And is called a milestone algorithm in the field of data mining.

The CART decision tree is an effective nonparametric classification and regression method, a binary tree is constructed by constructing a tree, pruning the tree and evaluating the tree, and the binary tree is a regression tree when a terminal node is a continuous variable; when the end node is a classification variable, the tree is a classification tree. There are generally three types of decision tree implementations, the ID3 algorithm, the C4.5 algorithm, and the CART algorithm. The main difference between the three methods lies in the difference of attribute selection metrics, the ID3 algorithm can only process a nominal data set by using information gain, and the C4.5 algorithm can process continuous data by classifying based on the ID3 by using information gain rate, but in the process of constructing a tree, the data set needs to be scanned and ordered sequentially for many times, thus resulting in the inefficiency of the algorithm. The CART algorithm uses the Gini index as an attribute selection measure, can also process continuous data, can process a classification problem and a regression problem, and improves the classification efficiency. Tree pruning uses statistical measures, less the least reliable branches, which results in faster classification, improving the ability of the tree to classify correctly independent of training data.

The embodiment of the present application provides a data transmission method based on an improved CART decision tree, and fig. 1 is a flow chart of steps of the data transmission method based on the improved CART decision tree according to the embodiment of the present application, as shown in fig. 1, the method includes the following steps:

step S102, obtaining a preset sample data set for training a decision tree, wherein the preset sample data set comprises a plurality of sample characteristics;

step S104, according to a preset sample data set, performing conversion calculation on a preset threshold value to obtain a stop condition;

step S106, according to each sample feature and segmentation point in a preset sample data set, reading and counting the preset sample data set at the same time, and storing a plurality of statistical results;

step S108, according to the statistical result, simplified calculation is carried out through Gini (D) of the preset sample data set and Gini (D, A) of the sample characteristics to obtain the gain of the Gini index of the sample characteristics;

step S110, selecting divided kini index gains from the kini index gains of a plurality of sample characteristics;

step S112, judging whether the dividing condition is met according to the dividing kini index gain and the stopping condition;

step S114, if yes, dividing the preset sample data set into a first sub-sample data set and a second sub-sample data set according to the optimal characteristics and the optimal dividing points, setting the first sub-sample data set and the second sub-sample data set as the preset sample data set, and repeatedly executing the step S104 to the step S112;

and step S116, if not, stopping the calculation of the preset sample data set, and generating the CART decision tree until the calculation of all the preset sample data sets is stopped.

It should be noted that there are many choices for the basis of whether the partition condition is satisfied, generally, the basis of whether the partition is performed is the basis of selecting the kini index, and the smaller the kini index is, the more suitable the classification is for the smaller the uncertainty of the sample data set is; the invention selects the Gini index gain as a basis, and the meaning of the Gini index gain is the difference value between the uncertainty of the sample data set before classification and the uncertainty of the sample data set after classification according to a certain characteristic, so that the larger the Gini index gain is, the more suitable the classification is.

Through the steps S102 to S116 in the embodiment of the application, the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol are solved, the computational complexity of the gain of the Gini index is simplified, the training efficiency of the model is improved, and the quality and the efficiency of real-time data transmission are ensured.

In addition, the data transmission method based on the improved CART decision tree in the embodiment can be applied to an FPGA chip or an ASIC design to reduce resource consumption and improve the operational performance of the chip.

In some embodiments, in step S108, performing simplified calculation by presetting the kini index Gini (D) of the sample data set and the kini index Gini (D, a) of the sample features, and obtaining the gain of the kini index includes:

calculating by a preset Gini (D) and Gini (D, A) of the sample characteristics to obtain the Gain = Gini (D) -Gini (D, A) of the sample characteristics, namely

Is simplified according to a formula to obtain

Further simplifying the formula obtained by the simplification according to the fact that N and S are constants before each classification, and obtaining the gain of the Gini index as

。

In particular, the Gain of the kini = Gini (D) -Gini (D, a), i.e. the Gain of the kini index is proportional to the Gain of the Gini (D) in the output of the generator

；

Combining the polynomials yields:

；

further combining to obtain:

；

because SN = S_rN + S_lN, obtaining:

；

since both N and S are constant before each classification, we can:

。

wherein N is the number of sample data in the preset sample data set, S is the number of sample data with positive label in N, and N is_lIs the number of sample data on the left side of the segmentation point in N, N_rIs the number of samples to the right of the point of truncation in NNumber of data, S_lIs N_lNumber of sample data of middle and positive label, S_rIs N_rThe number of sample data of the middle positive label.

It should be noted that D is a preset sample data set, a is a sample feature, and in the subsequent step S112, it is determined whether the division condition is satisfied according to the division kini index gain, because constants N and S omitted in the above formula transformation are needed, synchronous transformation is also needed for calculating the corresponding stop condition, and it is needed to multiply the preset threshold by the preset threshold

And carrying out corresponding operation.

According to the embodiment of the application, simplified calculation is carried out according to the preset kini index and the kini index of the sample characteristic to obtain the gain of the kini index, the calculation amount of the gain of the kini index calculated according to the mathematical transformation is greatly reduced, the calculation can be completed only by 2 times of multiplication and 2 times of addition, namely 1 time of addition, and only the calculation of the kini index needs 4 times of multiplication and 4 times of addition and 2 times of subtraction before simplification, and if the calculation of the gain of the kini index needs to increase the calculation amount, the calculation of the gain of the kini index of each time of the sample characteristic can be simplified through the flow, and the calculation performance is greatly improved.

Moreover, the calculation of the gain of the Gini index is divided into two parts in the time dimension, namely the conversion operation of the stop condition; secondly, calculating the gain of the transformed Gini index; the calculation parameters are fixed parameters before classification, and each preset sample space only needs to be calculated once, so that the calculation can be performed when the sample data set is calculated for the first time, the neutral period of the time is reasonably utilized, the time division multiplexing processing is performed on the multiplier and the divider in the calculation unit, the sample counting time and the sample calculation time are balanced, the data are subjected to flow processing in the system, and hardware resources are effectively saved.

In some embodiments, in step S108, the step of presetting the kini index gini (d) of the sample data set includes:

preset kini finger for preset sample data setGini (D), according to the number of sample data of positive label in preset sample data set, using formula

And obtaining N, wherein N is the number of sample data in a preset sample data set, and S is the number of sample data with a positive label in N.

In some of these embodiments, step S108, the cuni index Gini (D, a) of the sample feature includes:

selecting the sample characteristics and the segmentation points in a preset sample data set respectively according to the Gini index (D, A) of the sample characteristics, and obtaining the Gini index through a formula

Obtaining N, wherein N is the number of sample data in a preset sample data set, and N_lIs the number of sample data on the left side of the segmentation point in N, N_rIs the number of sample data on the right of the segmentation point in N, S_lIs N_lNumber of sample data of middle and positive label, S_rIs N_rThe number of sample data of the middle positive label.

For example, table 1 is a table of another preset sample data set according to an embodiment of the present application

TABLE 1

Age (age)	With houses	Have work	Credit situation	Whether or not to approve loan
					Young people	Whether or not	Whether or not	In general	Whether or not
Young people	Whether or not	Is that	Good taste	Is that
					Young people	Whether or not	Whether or not	Good taste	Whether or not
Young people	Is that	Is that	In general	Is that
					Middle-aged	Whether or not	Whether or not	Good taste	Whether or not
Middle-aged	Is that	Whether or not	Is very good	Is that
					Middle-aged	Whether or not	Whether or not	In general	Whether or not

As shown in table 1, the preset sample data set includes 7 sample data and 4 sample features. Here, "loan agreement" is a target of classification prediction, and "loan agreement" is "yes" is set as a positive label, and as a result, the number of sample data N =8 in the preset sample data set, and the number of sample data S = 3 in the positive label.

Furthermore, one of the sample characteristics is selected as 'house exists', and the sample characteristic 'house exists' is set as 'yes' to be a dividing point, so that the number N of sample data on the left side of the dividing point in N is shown_lNumber of sample data N on right of segmentation point in N = 2_r = 5，N_lNumber S of sample data of middle and positive label_l = 2，N_rNumber S of sample data of middle and positive label_r = 1。

It should be noted that the sample features used in this embodiment are discrete sample features, and it should be clear to those skilled in the art that the sample features in this embodiment may also be continuous sample features, that is, the setting of the segmentation point may be a set of variables in a certain domain.

In some embodiments, step S106 and step S108, according to each sample feature and segmentation point in a preset sample data set, reading and counting the preset sample data set at the same time, and storing a plurality of statistical results; and according to the statistical result, simplifying calculation is carried out through Gini index (D) of the preset sample data set and Gini index (D, A) of the sample characteristics to obtain the gain of the Gini index of the sample characteristics.

Specifically, in the embodiment of the application, a preset sample data set is read and counted according to sample characteristics, and according to a statistical result, the kini index gain of the sample characteristics is simply calculated through Gini (D) and Gini (D, a);

the traversal statistics of the preset sample data set is time-consuming, the time concentration is high, the calculation of the gain of the kini index consumes less time, but the data throughput is not enough, and if the calculation parallelism is improved, hardware resources are wasted.

Therefore, the read and counted N sample characteristic information is stored, and data between the preset sample data set statistics and the calculation of the gain of the kini index is cached, namely when the preset sample data set is read and counted for the second time, the first time of the calculation of the gain of the kini index is started; when the second reading and counting of the preset sample data set is completed, the first calculation of the gain of the kini index is completed.

According to the embodiment of the application, the parallelism of reading and counting processes with low resource occupation is improved, the time complexity of preset sample data set counting is reduced, the characteristics of high computing resource consumption, low occupation time and low complexity are utilized, the modules are effectively and reasonably multiplexed, and the overall performance of the system is improved.

In some embodiments, since the simplified calculation is performed in step S108 to obtain the kini index gain of the sample feature, in step S104, the calculation of the corresponding stop condition also needs to be synchronously transformed corresponding to the simplified calculation, that is, the conversion calculation is performed on the preset threshold to obtain the stop condition

Step S112, determining whether the division condition is satisfied according to the division kini index gain and the stop condition includes:

and judging whether the dividing kini index gain is larger than the stopping condition, if so, meeting the dividing condition, and if not, not meeting the dividing condition.

In some embodiments, after generating the CART decision tree at step S116, the method further includes:

and after a random forest is generated according to the CART decision trees, whether data are lost in network data transmission is judged in advance through the random forest, and the data which are judged to be lost are retransmitted in advance.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment of the present application provides a data transmission system based on an improved CART decision tree, fig. 2 is a structural block diagram of the data transmission system based on the improved CART decision tree according to the embodiment of the present application, and as shown in fig. 2, the system includes a sample control unit 21, a sample statistics unit 22, a storage unit 23, a calculation unit 24, a comparison and judgment unit 25, and a sample classification unit 26, where the system has a plurality of sample statistics units;

the sample control unit 21 is configured to obtain and control a preset sample data set required by training a decision tree;

the sample counting unit 22 is configured to read and count a preset sample data set according to a plurality of sample characteristics in the preset sample data set, respectively;

the storage unit 23 is used for receiving and storing the statistical result output by the sample statistical unit;

the calculating unit 24 is configured to perform simplified calculation according to the statistical result by presetting the kini index Gini (D) of the sample data set and the kini index Gini (D, a) of the sample feature to obtain the kini index gain of the sample feature and a calculation stop condition;

the comparison and judgment unit 25 is used for selecting the division kini index gain and judging whether the division condition is met;

the sample classifying unit 26 is configured to classify the preset sample data set according to the judgment of the comparing and judging unit.

Specifically, the calculating unit 24 is further configured to complete the conversion of the stop calculating condition according to the input sample information to be classified; the sample control unit 21 is further configured to control reading of sample data according to the input sample information to be classified and the start flag, and monitor the state of the sample statistics unit 22.

According to the embodiment of the application, the sample control unit 21 obtains and controls a preset sample data set required by training a decision tree, the sample statistical unit 22 reads and statistically calculates the preset sample data set according to a plurality of sample characteristics in the preset sample data set, the storage unit 23 receives and stores a statistical result output by the sample statistical unit, the calculation unit 24 performs simplified calculation according to the statistical result and through the kini indexes of the preset sample data set and the kini indexes of the sample characteristics to obtain the kini index gains of the sample characteristics and a calculation stop condition, the comparison judgment unit 25 selects the divided kini index gains and judges whether the division condition is met, the sample classification unit 26 classifies the preset sample data set according to the judgment of the comparison judgment unit, and the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol are solved, the computational complexity of the simplified Gini index gain is realized, the training efficiency of the model is improved, and the quality and the efficiency of data real-time transmission are ensured.

In some embodiments, the sample statistics unit is configured to, according to a number of sample features in the preset sample data set, respectively, the reading and the statistics of the preset sample data set include:

each sample statistical unit is used for reading and counting a preset sample data set according to one sample characteristic, and the preset sample data set is read and counted in parallel according to a plurality of sample characteristics through a plurality of sample statistical units;

and storing a plurality of statistical results obtained by reading statistics in a storage unit.

Specifically, in the embodiment of the present application, a preset sample data set is read and counted by a sample counting unit, and a kini index gain of a sample characteristic is calculated by a calculating unit;

the traversal of the sample counting unit on the preset sample data set is time-consuming and high in time concentration, while the computing unit is high in computing performance and low in time consumption, but the data throughput is insufficient, and if the parallelism of the computing unit is improved, hardware resources are wasted.

Therefore, the N pieces of sample information read and counted by the sample counting unit are stored by the storage unit, and the data between the calculation unit and the sample counting unit are cached, namely when the preset sample data set is read and counted for the second time, the calculation unit starts to calculate for the first time; when the second reading and counting of the preset sample data set are completed, the calculation of the first calculating unit is completed.

According to the embodiment of the application, the parallelism of the sample counting unit with low resource occupation is improved, the time complexity of counting the preset sample data set is reduced, the characteristic that the resource consumption of the computing unit is high, the occupation time complexity is low is utilized, the module is effectively and reasonably multiplexed, and the overall performance of the system is improved.

In some embodiments, the sample control unit is configured to control a preset sample data set required for training the decision tree, and includes:

and the sample control unit is used for controlling the sample statistical unit to read and count a preset sample data set according to the sample characteristics under the condition that the idle sample statistical unit exists.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In addition, in combination with the data transmission method based on the improved CART decision tree in the above embodiments, the embodiments of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program when executed by a processor implements any of the above embodiments of the improved CART decision tree based data transmission method.

In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data transmission based on an improved CART decision tree. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In one embodiment, fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 3, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 3. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through a network connection, the internal memory is used for providing an environment for an operating system and the running of a computer program, the computer program is executed by the processor to realize a data transmission method based on the improved CART decision tree, and the database is used for storing data.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A data transmission method based on an improved CART decision tree, the method comprising:

repeatedly executing the preset steps until CART decision trees are generated, generating a random forest according to the CART decision trees, and pre-judging whether data are lost in network data transmission through the random forest, and retransmitting pre-judged lost data in advance;

the presetting step comprises the following steps:

converting and calculating a preset threshold value according to the number N of sample data in the preset sample data set and the number S of sample data of the positive label to obtain a stopping condition, namely multiplying the preset threshold value by the number S of the sample data of the positive label

；

Reading and counting the preset sample data set in parallel according to each sample characteristic in the preset sample data set, and storing a plurality of counting results, namely storing the data of the plurality of sample characteristics read and counted;

calculating the Gini index (D) of the preset sample data set and the Gini index (D, A) of the sample characteristics one by one according to a plurality of statistical results to obtain the Gain of the Gini index of the sample characteristics

Is simplified according to a formula to obtain

Further, the simplified formula is further simplified according to the fact that N and S are constants before each classificationSimplified to give a gain of the Gini index of

Wherein N is the number of sample data in the preset sample data set, S is the number of sample data with a positive label in N, and N is_lIs the number of sample data on the left side of the segmentation point in N, N_rIs the number of sample data on the right of the segmentation point in N, S_lIs N_lNumber of sample data of middle and positive label, S_rIs N_rThe number of sample data of the middle and positive label;

2. The method according to claim 1, wherein the Gini index (D) of the preset sample data set comprises:

3. The method according to claim 1, wherein the Gini index (D, A) of the sample features comprises:

selecting the sample characteristics and the segmentation points in the preset sample data set respectively according to the Gini indexes (D, A) of the sample characteristics, and obtaining the Gini index of the sample characteristics through a formula

4. The method of claim 1, wherein determining whether a partitioning condition is satisfied based on the partitioning kini index gain and the stopping condition comprises:

5. A data transmission system based on an improved CART decision tree is characterized by comprising a sample control unit, a sample statistical unit, a storage unit, a calculation unit, a comparison and judgment unit and a sample classification unit, wherein the system is provided with a plurality of sample statistical units;

the storage unit is used for receiving and storing the statistical result output by the sample statistical unit, namely storing the data of the plurality of sample characteristics to be read and counted;

the computing unit is used for comparing the number N of the sample data in the preset sample data set with the number S of the sample data in the positive labelConverting and calculating a preset threshold value to obtain a stop condition, namely multiplying the preset threshold value by the stop condition

；

The calculation unit is further configured to calculate, according to a plurality of statistical results, a kini index Gini (D) of the preset sample data set and a kini index Gini (D, a) of the sample feature one by one to obtain a kini index Gain of the sample feature of Gini

Is simplified according to a formula to obtain

6. The system according to claim 5, wherein the sample statistics unit is configured to read and count the preset sample data set according to a number of sample features in the preset sample data set respectively, and includes:

7. The system according to claim 5, wherein the sample control unit is configured to control a preset sample data set required for training a decision tree, and comprises:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the improved CART decision tree based data transmission method according to any of claims 1 to 4 when executing the computer program.