CN113285845B - Data transmission method, system and equipment based on improved CART decision tree - Google Patents

Data transmission method, system and equipment based on improved CART decision tree Download PDF

Info

Publication number
CN113285845B
CN113285845B CN202110834148.5A CN202110834148A CN113285845B CN 113285845 B CN113285845 B CN 113285845B CN 202110834148 A CN202110834148 A CN 202110834148A CN 113285845 B CN113285845 B CN 113285845B
Authority
CN
China
Prior art keywords
sample data
sample
data set
preset
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110834148.5A
Other languages
Chinese (zh)
Other versions
CN113285845A (en
Inventor
苑志超
朱剑飞
刘奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Primate Intelligent Technology Hangzhou Co ltd
Original Assignee
Primate Intelligent Technology Hangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Primate Intelligent Technology Hangzhou Co ltd filed Critical Primate Intelligent Technology Hangzhou Co ltd
Priority to CN202110834148.5A priority Critical patent/CN113285845B/en
Publication of CN113285845A publication Critical patent/CN113285845A/en
Application granted granted Critical
Publication of CN113285845B publication Critical patent/CN113285845B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Algebra (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application relates to a data transmission method, a system and equipment based on an improved CART decision tree, wherein the method comprises the following steps: obtaining a stopping condition by performing conversion calculation on a preset threshold, reading and counting a preset sample data set according to each sample characteristic, and storing a plurality of statistical results; simplified calculation is carried out according to the statistical result to obtain the Gini index gain of the sample characteristics, whether the dividing condition is met or not is judged according to the Gini index gain and the stopping condition, and if the dividing condition is not met, the calculation of the preset sample data set is stopped; if the data transmission requirement is met, the preset sample data set is divided into two subdata sets, the subdata sets are set as the preset sample data set, and calculation is continued until the CART decision tree is generated.

Description

Data transmission method, system and equipment based on improved CART decision tree
Technical Field
The present application relates to the field of communication network transmission, and in particular, to a data transmission method, system and device based on an improved CART decision tree.
Background
In the world of everything interconnection, great data processing requirements put higher demands on transmission quality and security. At present, in general internet data transmission of TCP/IP protocol, after TCP sends a data packet, it starts a retransmission timer, and puts the packet into a retransmission queue. If the acknowledgement is received for the packet, the timer is cancelled. Otherwise, if the acknowledgement information is not received after the timer expires, the packet is considered to be lost, and then the packet is taken out from the retransmission queue, and is sent again (the timer is started in the same way), however, the transmission quality and the transmission efficiency of data are reduced due to high delay, packet loss and congestion existing in the internet data transmission process based on the retransmission timer, and the frequent packet loss of data caused by various network problems becomes a main factor affecting the experience of a plurality of application clients such as live audio and video broadcast, games and the like.
At present, no effective solution is provided for the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol in the related art.
Disclosure of Invention
The embodiment of the application provides a data transmission method, a system and equipment based on an improved CART decision tree, so as to at least solve the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol in the related art.
In a first aspect, an embodiment of the present application provides a data transmission method based on an improved CART decision tree, where the method includes:
acquiring a preset sample data set for training a decision tree, wherein the preset sample data set comprises a plurality of sample characteristics;
repeatedly executing a preset step, wherein the preset step comprises the following steps:
according to the preset sample data set, performing conversion calculation on a preset threshold value to obtain a stopping condition;
reading and counting the preset sample data set in parallel according to the characteristics of each sample in the preset sample data set, and storing a plurality of statistical results;
according to the statistical result, simplified calculation is carried out through the Gini index (D) of the preset sample data set and the Gini index (D, A) of the sample characteristics, and the gain of the Gini index of the sample characteristics is obtained;
selecting a divided kini index gain from the kini index gains of the plurality of sample characteristics;
judging whether the dividing conditions are met or not according to the dividing kini index gain and the stopping conditions;
if not, stopping the calculation of the preset sample data set, and generating a CART decision tree until the calculation of all the preset sample data sets is stopped;
if so, dividing the preset sample data set into a first sub-sample data set and a second sub-sample data set according to the optimal characteristics and the optimal dividing points, and setting the first sub-sample data set and the second sub-sample data set as the preset sample data set respectively.
In some embodiments, the obtaining of the gain of the kini index by performing simplified calculation on the kini index Gini (D) of the preset sample data set and the kini index Gini (D, a) of the sample features includes:
calculating by the preset Gini (D) and Gini (D, A) of the sample characteristic to obtain the Gain of Gini = Gini (D) -Gini (D, A) of the sample characteristic, namely
Figure DEST_PATH_IMAGE001
Is simplified according to a formula to obtain
Figure 100002_DEST_PATH_IMAGE002
Further simplifying the formula obtained by the simplification according to the condition that N and S are constants before each classification, and obtaining the gain of the Gini index as
Figure DEST_PATH_IMAGE003
Wherein N is the number of sample data in the preset sample data set, S is the number of sample data with a positive label in N, and N islIs the number of sample data on the left side of the segmentation point in N, NrIs the number of sample data on the right of the segmentation point in N, SlIs NlNumber of sample data of middle and positive label, SrIs NrThe number of sample data of the middle positive label.
In some of these embodiments, the kini index gini (d) of the preset sample data set comprises:
the preset Gini index (D) of the preset sample data set is obtained by the formula according to the number of the sample data of the positive label in the preset sample data set
Figure 100002_DEST_PATH_IMAGE004
And obtaining N, wherein N is the number of sample data in the preset sample data set, and S is the number of sample data with a positive label in N.
In some of these embodiments, the cuni index Gini (D, a) of the sample features includes:
selecting Gini indexes (D, A) of the sample characteristics respectivelyTaking sample characteristics and segmentation points in the preset sample data set, and passing through a formula
Figure DEST_PATH_IMAGE005
Obtaining N, wherein N is the number of sample data in the preset sample data set, and N islIs the number of sample data on the left side of the segmentation point in N, NrIs the number of sample data on the right of the segmentation point in N, SlIs NlNumber of sample data of middle and positive label, SrIs NrThe number of sample data of the middle positive label.
In some embodiments, determining whether a division condition is satisfied according to the division kini index gain and the stop condition comprises:
and judging whether the dividing kini index gain is larger than the stopping condition or not, if so, meeting the dividing condition, and if not, not meeting the dividing condition.
In some of these embodiments, after generating the CART decision tree, the method further comprises:
and generating a random forest according to the CART decision trees, and pre-judging whether data are lost in network data transmission or not through the random forest, and retransmitting pre-judged lost data in advance.
In a second aspect, the embodiment of the present application provides a data transmission system based on an improved CART decision tree, the system includes a sample control unit, a sample statistics unit, a storage unit, a calculation unit, a comparison and judgment unit, and a sample classification unit, wherein the system has a plurality of sample statistics units;
the sample control unit is used for acquiring and controlling a preset sample data set required by training a decision tree;
the sample counting unit is used for reading and counting the preset sample data set according to a plurality of sample characteristics in the preset sample data set respectively;
the storage unit is used for receiving and storing the statistical result output by the sample statistical unit;
the calculation unit is used for carrying out simplified calculation through the Gini index (D) of the preset sample data set and the Gini index (D, A) of the sample characteristics according to the statistical result to obtain the gain of the Gini index of the sample characteristics and a calculation stop condition;
the comparison and judgment unit is used for selecting the division kini index gain and judging whether the division condition is met;
the sample classification unit is used for classifying a preset sample data set according to the judgment of the comparison and judgment unit.
In some embodiments, the sample statistics unit is configured to read and count the preset sample data set according to a number of sample features in the preset sample data set, respectively, and includes:
each sample statistical unit is used for reading and counting the preset sample data set according to one sample characteristic, and a plurality of sample statistical units are used for reading and counting the preset sample data set in parallel according to a plurality of sample characteristics;
and storing a plurality of statistical results obtained by reading statistics in the storage unit.
In some of these embodiments, the sample control unit for controlling a preset set of sample data required for training a decision tree comprises:
the sample control unit is used for controlling the sample statistical unit to read and count a preset sample data set according to the sample characteristics under the condition that the idle sample statistical unit exists.
In a third aspect, the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor, when executing the computer program, implements the data transmission method based on the improved CART decision tree according to the first aspect.
Compared with the related art, the data transmission method, the system and the equipment based on the improved CART decision tree provided by the embodiment of the application have the advantages that the preset sample data set used for training the decision tree is obtained, the preset threshold value is converted and calculated according to the preset sample data set to obtain the stop condition, the preset sample data set is read and counted in parallel according to each sample characteristic in the preset sample data set, and a plurality of counting results are stored; according to the statistical result, simplified calculation is carried out through the Gini indexes of the preset sample data set and the Gini indexes of the sample characteristics to obtain the gain of the Gini indexes of the sample characteristics, the divided gain of the Gini indexes is selected from the gain of the Gini indexes of the sample characteristics, whether the dividing condition is met or not is judged according to the divided gain of the Gini indexes and the stopping condition, and if the dividing condition is not met, the calculation of the preset sample data set is stopped; if the first sub-sample data set and the second sub-sample data set are met, the preset sample data set is divided into a first sub-sample data set and a second sub-sample data set according to the optimal characteristics and the optimal dividing points, the first sub-sample data set and the second sub-sample data set are set as the preset sample data set respectively, calculation is continued until a CART decision tree is generated, the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol are solved, the computational complexity of the gain of the kiney index is simplified, the training efficiency of the model is improved, and the quality and the efficiency of real-time data transmission are guaranteed.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a flow chart of steps of a method of data transmission based on an improved CART decision tree according to an embodiment of the present application;
FIG. 2 is a block diagram of a data transmission system based on an improved CART decision tree according to an embodiment of the present application;
fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application.
Description of the drawings: 21. a sample control unit; 22. a sample counting unit; 23. a storage unit; 24. a calculation unit; 25. a comparison and judgment unit; 26. and a sample classification unit.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The invention provides a data transmission method based on an improved CART decision tree, which aims to solve the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol.
The communication network transmission field is very sensitive to delay, and the network environment changes rapidly due to different services, so that the training model needs to be adjusted frequently according to the services, and therefore, designing an online training model with high performance and high environment perception speed is an effective means for improving the quality of user experience (QOE).
The decision tree is a basic classification and regression method in a machine learning algorithm, is a basic framework of a plurality of tree promotion algorithms, such as XGboost, random forest and the like, is easy to understand, can construct a model without too many training samples, and has relatively high operation speed.
The cart (classification And Regression tree) algorithm is one of ten classic algorithms for data mining, And is called a milestone algorithm in the field of data mining.
The CART decision tree is an effective nonparametric classification and regression method, a binary tree is constructed by constructing a tree, pruning the tree and evaluating the tree, and the binary tree is a regression tree when a terminal node is a continuous variable; when the end node is a classification variable, the tree is a classification tree. There are generally three types of decision tree implementations, the ID3 algorithm, the C4.5 algorithm, and the CART algorithm. The main difference between the three methods lies in the difference of attribute selection metrics, the ID3 algorithm can only process a nominal data set by using information gain, and the C4.5 algorithm can process continuous data by classifying based on the ID3 by using information gain rate, but in the process of constructing a tree, the data set needs to be scanned and ordered sequentially for many times, thus resulting in the inefficiency of the algorithm. The CART algorithm uses the Gini index as an attribute selection measure, can also process continuous data, can process a classification problem and a regression problem, and improves the classification efficiency. Tree pruning uses statistical measures, less the least reliable branches, which results in faster classification, improving the ability of the tree to classify correctly independent of training data.
The embodiment of the present application provides a data transmission method based on an improved CART decision tree, and fig. 1 is a flow chart of steps of the data transmission method based on the improved CART decision tree according to the embodiment of the present application, as shown in fig. 1, the method includes the following steps:
step S102, obtaining a preset sample data set for training a decision tree, wherein the preset sample data set comprises a plurality of sample characteristics;
step S104, according to a preset sample data set, performing conversion calculation on a preset threshold value to obtain a stop condition;
step S106, according to each sample feature and segmentation point in a preset sample data set, reading and counting the preset sample data set at the same time, and storing a plurality of statistical results;
step S108, according to the statistical result, simplified calculation is carried out through Gini (D) of the preset sample data set and Gini (D, A) of the sample characteristics to obtain the gain of the Gini index of the sample characteristics;
step S110, selecting divided kini index gains from the kini index gains of a plurality of sample characteristics;
step S112, judging whether the dividing condition is met according to the dividing kini index gain and the stopping condition;
step S114, if yes, dividing the preset sample data set into a first sub-sample data set and a second sub-sample data set according to the optimal characteristics and the optimal dividing points, setting the first sub-sample data set and the second sub-sample data set as the preset sample data set, and repeatedly executing the step S104 to the step S112;
and step S116, if not, stopping the calculation of the preset sample data set, and generating the CART decision tree until the calculation of all the preset sample data sets is stopped.
It should be noted that there are many choices for the basis of whether the partition condition is satisfied, generally, the basis of whether the partition is performed is the basis of selecting the kini index, and the smaller the kini index is, the more suitable the classification is for the smaller the uncertainty of the sample data set is; the invention selects the Gini index gain as a basis, and the meaning of the Gini index gain is the difference value between the uncertainty of the sample data set before classification and the uncertainty of the sample data set after classification according to a certain characteristic, so that the larger the Gini index gain is, the more suitable the classification is.
Through the steps S102 to S116 in the embodiment of the application, the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol are solved, the computational complexity of the gain of the Gini index is simplified, the training efficiency of the model is improved, and the quality and the efficiency of real-time data transmission are ensured.
In addition, the data transmission method based on the improved CART decision tree in the embodiment can be applied to an FPGA chip or an ASIC design to reduce resource consumption and improve the operational performance of the chip.
In some embodiments, in step S108, performing simplified calculation by presetting the kini index Gini (D) of the sample data set and the kini index Gini (D, a) of the sample features, and obtaining the gain of the kini index includes:
calculating by a preset Gini (D) and Gini (D, A) of the sample characteristics to obtain the Gain = Gini (D) -Gini (D, A) of the sample characteristics, namely
Figure 293621DEST_PATH_IMAGE001
Is simplified according to a formula to obtain
Figure 741920DEST_PATH_IMAGE002
Further simplifying the formula obtained by the simplification according to the fact that N and S are constants before each classification, and obtaining the gain of the Gini index as
Figure 787237DEST_PATH_IMAGE003
In particular, the Gain of the kini = Gini (D) -Gini (D, a), i.e. the Gain of the kini index is proportional to the Gain of the Gini (D) in the output of the generator
Figure 464337DEST_PATH_IMAGE001
Combining the polynomials yields:
Figure 100002_DEST_PATH_IMAGE006
further combining to obtain:
Figure DEST_PATH_IMAGE007
because SN = SrN + SlN, obtaining:
Figure 100002_DEST_PATH_IMAGE008
since both N and S are constant before each classification, we can:
Figure DEST_PATH_IMAGE009
wherein N is the number of sample data in the preset sample data set, S is the number of sample data with positive label in N, and N islIs the number of sample data on the left side of the segmentation point in N, NrIs the number of samples to the right of the point of truncation in NNumber of data, SlIs NlNumber of sample data of middle and positive label, SrIs NrThe number of sample data of the middle positive label.
It should be noted that D is a preset sample data set, a is a sample feature, and in the subsequent step S112, it is determined whether the division condition is satisfied according to the division kini index gain, because constants N and S omitted in the above formula transformation are needed, synchronous transformation is also needed for calculating the corresponding stop condition, and it is needed to multiply the preset threshold by the preset threshold
Figure 100002_DEST_PATH_IMAGE010
And carrying out corresponding operation.
According to the embodiment of the application, simplified calculation is carried out according to the preset kini index and the kini index of the sample characteristic to obtain the gain of the kini index, the calculation amount of the gain of the kini index calculated according to the mathematical transformation is greatly reduced, the calculation can be completed only by 2 times of multiplication and 2 times of addition, namely 1 time of addition, and only the calculation of the kini index needs 4 times of multiplication and 4 times of addition and 2 times of subtraction before simplification, and if the calculation of the gain of the kini index needs to increase the calculation amount, the calculation of the gain of the kini index of each time of the sample characteristic can be simplified through the flow, and the calculation performance is greatly improved.
Moreover, the calculation of the gain of the Gini index is divided into two parts in the time dimension, namely the conversion operation of the stop condition; secondly, calculating the gain of the transformed Gini index; the calculation parameters are fixed parameters before classification, and each preset sample space only needs to be calculated once, so that the calculation can be performed when the sample data set is calculated for the first time, the neutral period of the time is reasonably utilized, the time division multiplexing processing is performed on the multiplier and the divider in the calculation unit, the sample counting time and the sample calculation time are balanced, the data are subjected to flow processing in the system, and hardware resources are effectively saved.
In some embodiments, in step S108, the step of presetting the kini index gini (d) of the sample data set includes:
preset kini finger for preset sample data setGini (D), according to the number of sample data of positive label in preset sample data set, using formula
Figure 747550DEST_PATH_IMAGE004
And obtaining N, wherein N is the number of sample data in a preset sample data set, and S is the number of sample data with a positive label in N.
In some of these embodiments, step S108, the cuni index Gini (D, a) of the sample feature includes:
selecting the sample characteristics and the segmentation points in a preset sample data set respectively according to the Gini index (D, A) of the sample characteristics, and obtaining the Gini index through a formula
Figure DEST_PATH_IMAGE011
Obtaining N, wherein N is the number of sample data in a preset sample data set, and NlIs the number of sample data on the left side of the segmentation point in N, NrIs the number of sample data on the right of the segmentation point in N, SlIs NlNumber of sample data of middle and positive label, SrIs NrThe number of sample data of the middle positive label.
For example, table 1 is a table of another preset sample data set according to an embodiment of the present application
TABLE 1
Age (age) With houses Have work Credit situation Whether or not to approve loan
Young people Whether or not Whether or not In general Whether or not
Young people Whether or not Is that Good taste Is that
Young people Whether or not Whether or not Good taste Whether or not
Young people Is that Is that In general Is that
Middle-aged Whether or not Whether or not Good taste Whether or not
Middle-aged Is that Whether or not Is very good Is that
Middle-aged Whether or not Whether or not In general Whether or not
As shown in table 1, the preset sample data set includes 7 sample data and 4 sample features. Here, "loan agreement" is a target of classification prediction, and "loan agreement" is "yes" is set as a positive label, and as a result, the number of sample data N =8 in the preset sample data set, and the number of sample data S = 3 in the positive label.
Furthermore, one of the sample characteristics is selected as 'house exists', and the sample characteristic 'house exists' is set as 'yes' to be a dividing point, so that the number N of sample data on the left side of the dividing point in N is shownlNumber of sample data N on right of segmentation point in N = 2r = 5,NlNumber S of sample data of middle and positive labell = 2,NrNumber S of sample data of middle and positive labelr = 1。
It should be noted that the sample features used in this embodiment are discrete sample features, and it should be clear to those skilled in the art that the sample features in this embodiment may also be continuous sample features, that is, the setting of the segmentation point may be a set of variables in a certain domain.
In some embodiments, step S106 and step S108, according to each sample feature and segmentation point in a preset sample data set, reading and counting the preset sample data set at the same time, and storing a plurality of statistical results; and according to the statistical result, simplifying calculation is carried out through Gini index (D) of the preset sample data set and Gini index (D, A) of the sample characteristics to obtain the gain of the Gini index of the sample characteristics.
Specifically, in the embodiment of the application, a preset sample data set is read and counted according to sample characteristics, and according to a statistical result, the kini index gain of the sample characteristics is simply calculated through Gini (D) and Gini (D, a);
the traversal statistics of the preset sample data set is time-consuming, the time concentration is high, the calculation of the gain of the kini index consumes less time, but the data throughput is not enough, and if the calculation parallelism is improved, hardware resources are wasted.
Therefore, the read and counted N sample characteristic information is stored, and data between the preset sample data set statistics and the calculation of the gain of the kini index is cached, namely when the preset sample data set is read and counted for the second time, the first time of the calculation of the gain of the kini index is started; when the second reading and counting of the preset sample data set is completed, the first calculation of the gain of the kini index is completed.
According to the embodiment of the application, the parallelism of reading and counting processes with low resource occupation is improved, the time complexity of preset sample data set counting is reduced, the characteristics of high computing resource consumption, low occupation time and low complexity are utilized, the modules are effectively and reasonably multiplexed, and the overall performance of the system is improved.
In some embodiments, since the simplified calculation is performed in step S108 to obtain the kini index gain of the sample feature, in step S104, the calculation of the corresponding stop condition also needs to be synchronously transformed corresponding to the simplified calculation, that is, the conversion calculation is performed on the preset threshold to obtain the stop condition
Step S112, determining whether the division condition is satisfied according to the division kini index gain and the stop condition includes:
and judging whether the dividing kini index gain is larger than the stopping condition, if so, meeting the dividing condition, and if not, not meeting the dividing condition.
In some embodiments, after generating the CART decision tree at step S116, the method further includes:
and after a random forest is generated according to the CART decision trees, whether data are lost in network data transmission is judged in advance through the random forest, and the data which are judged to be lost are retransmitted in advance.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment of the present application provides a data transmission system based on an improved CART decision tree, fig. 2 is a structural block diagram of the data transmission system based on the improved CART decision tree according to the embodiment of the present application, and as shown in fig. 2, the system includes a sample control unit 21, a sample statistics unit 22, a storage unit 23, a calculation unit 24, a comparison and judgment unit 25, and a sample classification unit 26, where the system has a plurality of sample statistics units;
the sample control unit 21 is configured to obtain and control a preset sample data set required by training a decision tree;
the sample counting unit 22 is configured to read and count a preset sample data set according to a plurality of sample characteristics in the preset sample data set, respectively;
the storage unit 23 is used for receiving and storing the statistical result output by the sample statistical unit;
the calculating unit 24 is configured to perform simplified calculation according to the statistical result by presetting the kini index Gini (D) of the sample data set and the kini index Gini (D, a) of the sample feature to obtain the kini index gain of the sample feature and a calculation stop condition;
the comparison and judgment unit 25 is used for selecting the division kini index gain and judging whether the division condition is met;
the sample classifying unit 26 is configured to classify the preset sample data set according to the judgment of the comparing and judging unit.
Specifically, the calculating unit 24 is further configured to complete the conversion of the stop calculating condition according to the input sample information to be classified; the sample control unit 21 is further configured to control reading of sample data according to the input sample information to be classified and the start flag, and monitor the state of the sample statistics unit 22.
According to the embodiment of the application, the sample control unit 21 obtains and controls a preset sample data set required by training a decision tree, the sample statistical unit 22 reads and statistically calculates the preset sample data set according to a plurality of sample characteristics in the preset sample data set, the storage unit 23 receives and stores a statistical result output by the sample statistical unit, the calculation unit 24 performs simplified calculation according to the statistical result and through the kini indexes of the preset sample data set and the kini indexes of the sample characteristics to obtain the kini index gains of the sample characteristics and a calculation stop condition, the comparison judgment unit 25 selects the divided kini index gains and judges whether the division condition is met, the sample classification unit 26 classifies the preset sample data set according to the judgment of the comparison judgment unit, and the problems of high delay and packet loss in network data transmission based on a TCP/IP protocol are solved, the computational complexity of the simplified Gini index gain is realized, the training efficiency of the model is improved, and the quality and the efficiency of data real-time transmission are ensured.
In some embodiments, the sample statistics unit is configured to, according to a number of sample features in the preset sample data set, respectively, the reading and the statistics of the preset sample data set include:
each sample statistical unit is used for reading and counting a preset sample data set according to one sample characteristic, and the preset sample data set is read and counted in parallel according to a plurality of sample characteristics through a plurality of sample statistical units;
and storing a plurality of statistical results obtained by reading statistics in a storage unit.
Specifically, in the embodiment of the present application, a preset sample data set is read and counted by a sample counting unit, and a kini index gain of a sample characteristic is calculated by a calculating unit;
the traversal of the sample counting unit on the preset sample data set is time-consuming and high in time concentration, while the computing unit is high in computing performance and low in time consumption, but the data throughput is insufficient, and if the parallelism of the computing unit is improved, hardware resources are wasted.
Therefore, the N pieces of sample information read and counted by the sample counting unit are stored by the storage unit, and the data between the calculation unit and the sample counting unit are cached, namely when the preset sample data set is read and counted for the second time, the calculation unit starts to calculate for the first time; when the second reading and counting of the preset sample data set are completed, the calculation of the first calculating unit is completed.
According to the embodiment of the application, the parallelism of the sample counting unit with low resource occupation is improved, the time complexity of counting the preset sample data set is reduced, the characteristic that the resource consumption of the computing unit is high, the occupation time complexity is low is utilized, the module is effectively and reasonably multiplexed, and the overall performance of the system is improved.
In some embodiments, the sample control unit is configured to control a preset sample data set required for training the decision tree, and includes:
and the sample control unit is used for controlling the sample statistical unit to read and count a preset sample data set according to the sample characteristics under the condition that the idle sample statistical unit exists.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.
In addition, in combination with the data transmission method based on the improved CART decision tree in the above embodiments, the embodiments of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program when executed by a processor implements any of the above embodiments of the improved CART decision tree based data transmission method.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of data transmission based on an improved CART decision tree. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
In one embodiment, fig. 3 is a schematic diagram of an internal structure of an electronic device according to an embodiment of the present application, and as shown in fig. 3, there is provided an electronic device, which may be a server, and its internal structure diagram may be as shown in fig. 3. The electronic device comprises a processor, a network interface, an internal memory and a non-volatile memory connected by an internal bus, wherein the non-volatile memory stores an operating system, a computer program and a database. The processor is used for providing calculation and control capability, the network interface is used for communicating with an external terminal through a network connection, the internal memory is used for providing an environment for an operating system and the running of a computer program, the computer program is executed by the processor to realize a data transmission method based on the improved CART decision tree, and the database is used for storing data.
Those skilled in the art will appreciate that the architecture shown in fig. 3 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be understood by those skilled in the art that various features of the above-described embodiments can be combined in any combination, and for the sake of brevity, all possible combinations of features in the above-described embodiments are not described in detail, but rather, all combinations of features which are not inconsistent with each other should be construed as being within the scope of the present disclosure.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A data transmission method based on an improved CART decision tree, the method comprising:
acquiring a preset sample data set for training a decision tree, wherein the preset sample data set comprises a plurality of sample characteristics;
repeatedly executing the preset steps until CART decision trees are generated, generating a random forest according to the CART decision trees, and pre-judging whether data are lost in network data transmission through the random forest, and retransmitting pre-judged lost data in advance;
the presetting step comprises the following steps:
converting and calculating a preset threshold value according to the number N of sample data in the preset sample data set and the number S of sample data of the positive label to obtain a stopping condition, namely multiplying the preset threshold value by the number S of the sample data of the positive label
Figure DEST_PATH_IMAGE002
Reading and counting the preset sample data set in parallel according to each sample characteristic in the preset sample data set, and storing a plurality of counting results, namely storing the data of the plurality of sample characteristics read and counted;
calculating the Gini index (D) of the preset sample data set and the Gini index (D, A) of the sample characteristics one by one according to a plurality of statistical results to obtain the Gain of the Gini index of the sample characteristics
Figure DEST_PATH_IMAGE004
Is simplified according to a formula to obtain
Figure DEST_PATH_IMAGE006
Further, the simplified formula is further simplified according to the fact that N and S are constants before each classificationSimplified to give a gain of the Gini index of
Figure DEST_PATH_IMAGE008
Wherein N is the number of sample data in the preset sample data set, S is the number of sample data with a positive label in N, and N islIs the number of sample data on the left side of the segmentation point in N, NrIs the number of sample data on the right of the segmentation point in N, SlIs NlNumber of sample data of middle and positive label, SrIs NrThe number of sample data of the middle and positive label;
selecting a divided kini index gain from the kini index gains of the plurality of sample characteristics;
judging whether the dividing conditions are met or not according to the dividing kini index gain and the stopping conditions;
if not, stopping the calculation of the preset sample data set, and generating a CART decision tree until the calculation of all the preset sample data sets is stopped;
if so, dividing the preset sample data set into a first sub-sample data set and a second sub-sample data set according to the optimal characteristics and the optimal dividing points, and setting the first sub-sample data set and the second sub-sample data set as the preset sample data set respectively.
2. The method according to claim 1, wherein the Gini index (D) of the preset sample data set comprises:
the preset Gini index (D) of the preset sample data set is obtained by the formula according to the number of the sample data of the positive label in the preset sample data set
Figure DEST_PATH_IMAGE010
And obtaining N, wherein N is the number of sample data in the preset sample data set, and S is the number of sample data with a positive label in N.
3. The method according to claim 1, wherein the Gini index (D, A) of the sample features comprises:
selecting the sample characteristics and the segmentation points in the preset sample data set respectively according to the Gini indexes (D, A) of the sample characteristics, and obtaining the Gini index of the sample characteristics through a formula
Figure DEST_PATH_IMAGE012
Obtaining N, wherein N is the number of sample data in the preset sample data set, and N islIs the number of sample data on the left side of the segmentation point in N, NrIs the number of sample data on the right of the segmentation point in N, SlIs NlNumber of sample data of middle and positive label, SrIs NrThe number of sample data of the middle positive label.
4. The method of claim 1, wherein determining whether a partitioning condition is satisfied based on the partitioning kini index gain and the stopping condition comprises:
and judging whether the dividing kini index gain is larger than the stopping condition or not, if so, meeting the dividing condition, and if not, not meeting the dividing condition.
5. A data transmission system based on an improved CART decision tree is characterized by comprising a sample control unit, a sample statistical unit, a storage unit, a calculation unit, a comparison and judgment unit and a sample classification unit, wherein the system is provided with a plurality of sample statistical units;
the sample control unit is used for acquiring and controlling a preset sample data set required by training a decision tree;
the sample counting unit is used for reading and counting the preset sample data set according to a plurality of sample characteristics in the preset sample data set respectively;
the storage unit is used for receiving and storing the statistical result output by the sample statistical unit, namely storing the data of the plurality of sample characteristics to be read and counted;
the computing unit is used for comparing the number N of the sample data in the preset sample data set with the number S of the sample data in the positive labelConverting and calculating a preset threshold value to obtain a stop condition, namely multiplying the preset threshold value by the stop condition
Figure DEST_PATH_IMAGE013
The calculation unit is further configured to calculate, according to a plurality of statistical results, a kini index Gini (D) of the preset sample data set and a kini index Gini (D, a) of the sample feature one by one to obtain a kini index Gain of the sample feature of Gini
Figure DEST_PATH_IMAGE014
Is simplified according to a formula to obtain
Figure DEST_PATH_IMAGE006A
Further simplifying the formula obtained by the simplification according to the condition that N and S are constants before each classification, and obtaining the gain of the Gini index as
Figure DEST_PATH_IMAGE008A
Wherein N is the number of sample data in the preset sample data set, S is the number of sample data with a positive label in N, and N islIs the number of sample data on the left side of the segmentation point in N, NrIs the number of sample data on the right of the segmentation point in N, SlIs NlNumber of sample data of middle and positive label, SrIs NrThe number of sample data of the middle and positive label;
the comparison and judgment unit is used for selecting the division kini index gain and judging whether the division condition is met;
the sample classification unit is used for classifying a preset sample data set according to the judgment of the comparison and judgment unit.
6. The system according to claim 5, wherein the sample statistics unit is configured to read and count the preset sample data set according to a number of sample features in the preset sample data set respectively, and includes:
each sample statistical unit is used for reading and counting the preset sample data set according to one sample characteristic, and a plurality of sample statistical units are used for reading and counting the preset sample data set in parallel according to a plurality of sample characteristics;
and storing a plurality of statistical results obtained by reading statistics in the storage unit.
7. The system according to claim 5, wherein the sample control unit is configured to control a preset sample data set required for training a decision tree, and comprises:
the sample control unit is used for controlling the sample statistical unit to read and count a preset sample data set according to the sample characteristics under the condition that the idle sample statistical unit exists.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the improved CART decision tree based data transmission method according to any of claims 1 to 4 when executing the computer program.
CN202110834148.5A 2021-07-23 2021-07-23 Data transmission method, system and equipment based on improved CART decision tree Active CN113285845B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110834148.5A CN113285845B (en) 2021-07-23 2021-07-23 Data transmission method, system and equipment based on improved CART decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110834148.5A CN113285845B (en) 2021-07-23 2021-07-23 Data transmission method, system and equipment based on improved CART decision tree

Publications (2)

Publication Number Publication Date
CN113285845A CN113285845A (en) 2021-08-20
CN113285845B true CN113285845B (en) 2022-01-14

Family

ID=77287149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110834148.5A Active CN113285845B (en) 2021-07-23 2021-07-23 Data transmission method, system and equipment based on improved CART decision tree

Country Status (1)

Country Link
CN (1) CN113285845B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI525317B (en) * 2013-10-08 2016-03-11 國立清華大學 Method of Optical Defect Detection through Image analysis and Data Mining Integrated
CN108228656B (en) * 2016-12-21 2021-05-25 普天信息技术有限公司 URL classification method and device based on CART decision tree
CN110445653B (en) * 2019-08-12 2022-03-29 灵长智能科技(杭州)有限公司 Network state prediction method, device, equipment and medium
CN110516884A (en) * 2019-08-30 2019-11-29 贵州大学 A kind of short-term load forecasting method based on big data platform
CN112528277A (en) * 2020-12-07 2021-03-19 昆明理工大学 Hybrid intrusion detection method based on recurrent neural network

Also Published As

Publication number Publication date
CN113285845A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN109902708B (en) Recommendation model training method and related device
US10027739B1 (en) Performance-based content delivery
CN110941667B (en) Method and system for calculating and unloading in mobile edge calculation network
US9769248B1 (en) Performance-based content delivery
Zhou QoE-driven delay announcement for cloud mobile media
CN107360032B (en) Network flow identification method and electronic equipment
CN112996056A (en) Method and device for unloading time delay optimized computing task under cloud edge cooperation
CN113015253A (en) Resource allocation method and system for multi-service coexisting network slice
US9893973B2 (en) Real-time, low memory estimation of unique client computers communicating with a server computer
CN111176820A (en) Deep neural network-based edge computing task allocation method and device
CN109360149A (en) A kind of picture upload method, system and terminal device
US11726829B2 (en) Adaptive, performance-oriented, and compression-assisted encryption scheme
CN113285845B (en) Data transmission method, system and equipment based on improved CART decision tree
US20150089052A1 (en) Context-Aware HTTP Compression
CN110879749B (en) Scheduling method and scheduling device for real-time transcoding task
KR20050056899A (en) Systems and methods that employ process algebra to specify contracts and utilize performance prediction implementations thereof to measure the specifications
Papageorgiou et al. Decision support for Web service adaptation
CN115952398B (en) Traditional calculation method, system and storage medium based on data of Internet of things
US11374869B2 (en) Managing bandwidth based on user behavior
CN115242800B (en) Game theory-based mobile edge computing resource optimization method and device
CN112700003A (en) Network structure search method, device, equipment, storage medium and program product
CN111612732A (en) Image quality evaluation method, image quality evaluation device, computer equipment and storage medium
US11269974B1 (en) Learning ordinal regression model via divide-and-conquer technique
CN115576973A (en) Service deployment method, device, computer equipment and readable storage medium
CN113590322A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant