CN114553591B - Training method of random forest model, abnormal flow detection method and device - Google Patents

Training method of random forest model, abnormal flow detection method and device Download PDF

Info

Publication number
CN114553591B
CN114553591B CN202210279285.1A CN202210279285A CN114553591B CN 114553591 B CN114553591 B CN 114553591B CN 202210279285 A CN202210279285 A CN 202210279285A CN 114553591 B CN114553591 B CN 114553591B
Authority
CN
China
Prior art keywords
random forest
forest model
training
sample
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210279285.1A
Other languages
Chinese (zh)
Other versions
CN114553591A (en
Inventor
白兴伟
王闰婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huayuan Information Technology Co Ltd
Original Assignee
Beijing Huayuan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Huayuan Information Technology Co Ltd filed Critical Beijing Huayuan Information Technology Co Ltd
Priority to CN202210279285.1A priority Critical patent/CN114553591B/en
Publication of CN114553591A publication Critical patent/CN114553591A/en
Application granted granted Critical
Publication of CN114553591B publication Critical patent/CN114553591B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the disclosure provides a training method of a random forest model, an abnormal flow detection method and a device. The method comprises the steps of collecting original flow data; performing feature labeling on the original flow data sample to generate a training sample; randomly dividing the training samples into a training set and a testing set; preprocessing a training set sample according to the characteristics, and training a random forest model according to the preprocessed training set sample; and inputting the test set sample into the trained random forest model, comparing the characteristics output by the random forest model with the marking characteristics, and judging whether the trained random forest model meets the safety requirement according to the comparison result. In this way, the recognition speed and accuracy of the random forest model to the flow characteristics can be improved, and the analysis and detection requirements to the abnormal flow characteristics are met.

Description

Training method of random forest model, abnormal flow detection method and device
Technical Field
The disclosure relates to the technical field of network security, in particular to a training method of a random forest model, an abnormal flow detection method and an abnormal flow detection device.
Background
At present, the network is filled with a plurality of malicious attack traffic, so that the risk of the network is gradually increased, under the background, abnormal traffic in the network is detected, and security analysis is carried out on transmission data in the network, so that malicious attacks such as stealing and modifying of user and application data in the network are avoided, and an alarm is issued before huge loss is caused, and the network is timely disposed.
The random forest algorithm is free from noise data, so that the problem of over fitting can be effectively solved, the method has high accuracy and stability, and is widely applied to the safety field, however, the detection result of the current random forest algorithm on abnormal flow is not accurate enough due to the fact that the abnormal flow is various and quick in change.
Disclosure of Invention
The disclosure provides a training method of a random forest model, an abnormal flow detection method and a device.
According to a first aspect of the present disclosure, there is provided a training method of a random forest model, including:
collecting original flow data;
performing feature labeling on the original flow data sample to generate a training sample;
randomly dividing the training samples into a training set and a testing set;
preprocessing a training set sample according to the characteristics, and training a random forest model according to the preprocessed training set sample;
and inputting the test set sample into the trained random forest model, comparing the characteristics output by the random forest model with the marking characteristics, and judging whether the trained random forest model meets the safety requirement according to the comparison result.
In some implementations of the first aspect, the characterizing the original traffic data samples includes:
the raw flow data samples are classified according to characteristics,
marking the original flow data sample as normal and/or abnormal according to the classification result;
processing the default values: when the number of default values is small and the influence degree is small, discarding the corresponding default values, and when the number of default values is large or the influence degree is large, assigning the average value of the features to the corresponding default values;
processing the abnormal value: processing is performed according to the difference between the abnormal value and the standard value.
In some implementations of the first aspect, the preprocessing the training set samples according to the features includes:
and marking the characteristic values of the training set samples according to the characteristics, sorting the training set samples according to the characteristic values, and selecting a plurality of previous samples as a new sample set.
In some implementations of the first aspect, the marking feature values for the training set samples according to features includes:
and (3) carrying out decentralization processing on the training set samples, calculating the correlation among the features by using a covariance matrix, and marking the feature values according to the correlation.
In some implementations of the first aspect, inputting the test set sample into the trained random forest model, comparing the feature output by the random forest model with the signature feature, and determining whether the trained random forest model meets the security requirement according to the comparison result includes:
setting a threshold according to the safety requirement, and sequentially inputting test set samples into a random forest model;
if the consistency ratio of the features output by the random forest model and the marking features is smaller than a threshold value, the safety requirement is not met; and if the consistency ratio of the features output by the random forest model and the marking features is greater than a threshold value, meeting the safety requirement.
In some implementations of the first aspect, further comprising a test set sample supplemental training set comprising:
inputting the test set sample into the trained random forest model, marking the corresponding flow data sample characteristics when the random forest model cannot output the characteristics, and storing the corresponding flow data sample and the characteristics into the training set.
According to a second aspect of the present disclosure, there is provided an abnormal flow detection method based on a random forest model, including:
and inputting the flow data into the random forest model obtained by training by the training method of the random forest model, and judging whether the flow is abnormal or not according to the characteristics output by the random forest model.
In some implementations of the second aspect, further comprising updating the training set samples, including:
when the random forest model cannot output the characteristics, the corresponding flow data characteristics are marked, and the corresponding flow data and the characteristics are stored in the training set.
According to a third aspect of the present disclosure, there is provided a training apparatus of a random forest model, comprising:
the acquisition unit is used for acquiring original flow data;
the marking unit is used for carrying out characteristic marking on the original flow data sample to generate a training sample;
the grouping unit is used for randomly dividing the training samples into a training set and a testing set;
the training unit is used for preprocessing the training set samples according to the characteristics and training the random forest model according to the preprocessed training set samples;
the test unit is used for inputting the test set sample into the trained random forest model, comparing the characteristics output by the random forest model with the marking characteristics, and judging whether the trained random forest model meets the safety requirement according to the comparison result.
According to a fourth aspect of the present disclosure, there is provided an abnormal flow detection apparatus based on a random forest model, including:
the model generation unit is used for training by adopting the training method of the random forest model to obtain the random forest model;
and the judging unit is used for inputting the data flow into the trained random forest model and judging whether the data flow is abnormal flow or not according to the characteristics output by the random forest model.
According to the method and the device, the random forest model is matched with the original flow data characteristics, and the random forest model is trained, so that the random forest model can learn according to the original flow data characteristics, the recognition speed and accuracy of the random forest model on the flow characteristics are improved, and the analysis and detection requirements on the abnormal flow characteristics are met.
It should be understood that what is described in this summary is not intended to limit the critical or essential features of the embodiments of the disclosure nor to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. For a better understanding of the present disclosure, and without limiting the disclosure thereto, the same or similar reference numerals denote the same or similar elements, wherein:
FIG. 1 illustrates a flow chart of a training method of a random forest model according to an embodiment of the present disclosure;
FIG. 2 illustrates a flow chart of a random forest model-based abnormal flow detection method according to an embodiment of the present disclosure;
FIG. 3 illustrates a block diagram of a training apparatus of a random forest model, according to an embodiment of the present disclosure;
FIG. 4 illustrates a block diagram of an anomaly flow detection device based on a random forest model, according to an embodiment of the present disclosure;
fig. 5 illustrates a block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the method, the random forest model is matched with the original flow data characteristics, and the random forest model is trained, so that the random forest model can learn according to the original flow data characteristics, the recognition speed and accuracy of the random forest model on the flow characteristics are improved, and the analysis and detection requirements on the abnormal flow characteristics are met.
Fig. 1 shows a flowchart of a training method 100 of a random forest model according to an embodiment of the present disclosure.
As shown in fig. 1, the training method 100 of the random forest model includes:
s101, collecting original flow data;
s102, performing feature labeling on the original flow data sample to generate a training sample;
s103, randomly dividing the training sample into a training set and a testing set;
s104, preprocessing a training set sample according to the characteristics, and training a random forest model according to the preprocessed training set sample;
s105, inputting the test set sample into the trained random forest model, comparing the characteristics output by the random forest model with the marking characteristics, and judging whether the trained random forest model meets the safety requirement according to the comparison result.
Random forest refers to a classifier that trains and predicts samples using multiple trees, in this disclosure, the training and predicting of raw flow data and its signature is mainly performed using a random forest model.
In step S101, the original traffic data may be collected by using an open source packet capturing tool, which is currently used in the present application, such as Wireshark, hping, and in some embodiments, the flow packet is captured by using Wireshark and stored in a pcap format, so as to obtain an original traffic data sample. The content contained in the pcap is file header, packet header and packet data, but the original flow data only contains packet data, and the file header, packet header and the original flow data need to be converted.
In step S102, the characterizing the original flow data sample includes:
the raw flow data samples are classified according to characteristics,
marking the original flow data sample as normal and/or abnormal according to the classification result;
processing the default values: when the number of default values is small and the influence degree is small, discarding the corresponding default values, and when the number of default values is large or the influence degree is large, assigning the average value of the features to the corresponding default values;
processing the abnormal value: processing is performed according to the difference between the abnormal value and the standard value.
In some implementations, classifying the raw flow data samples according to features includes: filtering the pcap file according to the protocol type, extracting the encrypted network traffic data, and classifying traffic characteristics according to source/destination IP, source/destination port and protocol.
In some embodiments, the original flow data samples are marked as normal and/or abnormal according to the classification result, only normal may be marked, the rest of the original flow data samples default to abnormal, or only abnormal may be marked, the rest of the original flow data samples default to normal, or both normal and abnormal original flow data samples may be marked.
In some embodiments, the mapping process further includes mapping the original traffic data samples, including: and selecting a scapy module in python, processing the acquired original flow data sample, selecting corresponding characteristic items for different applications, and mapping complex and inconvenient character sample characteristics into a digital sequence in a mapping form of key value pairs.
For the abnormal value processing, a data visualization mode can be adopted, a scatter diagram or a box diagram is utilized to judge the abnormal value, the maximum value and the minimum value of a data set are recorded, and the abnormal value processing is carried out according to experience judgment and the difference between the corresponding value and the standard value.
According to the embodiment of the disclosure, the feature marking is performed according to the original flow data sample, and when the random forest model faces new flow data, since the random forest model is already learned according to the original flow sample and the features thereof, the new flow data can be quickly identified. And the default value and the abnormal value are processed, so that the processing requirements of various different types of flow data can be met.
In step S103, the training samples are randomly divided into a training set and a testing set;
according to the embodiment of the disclosure, a random sampling mode is adopted to ensure that data are randomly extracted from training samples, and uncertainty and unbalance of traffic data distribution in a real network environment are simulated. And then dividing the training sample data subjected to random sampling into a training set and a testing set, so that subsequent training and evaluation of a random forest model are facilitated.
In step S104, the preprocessing the training set sample according to the features includes:
and marking the characteristic values of the training set samples according to the characteristics, sorting the training set samples according to the characteristic values, and selecting a plurality of previous samples as a new sample set.
It will be appreciated that the higher the sample eigenvalues, the more likely the ranking will be in the new sample set, and therefore the eigenvalue markers will have a very important impact on the new sample set. The more the number of the reserved new sample sets is, the better the training effect is, and the less the number of the reserved new sample sets is, the higher the training efficiency is.
According to the embodiment of the disclosure, the new sample set is redetermined according to the characteristic value, so that the number of training samples can be reduced, and the training effect is ensured as much as possible while the training efficiency of the random forest model is improved.
In some implementations, the marking feature values for the training set samples according to features includes:
and (3) carrying out decentralization processing on the training set samples, calculating the correlation among the features by using a covariance matrix, and marking the feature values according to the correlation.
Specifically, feature reduction and feature extraction of a sample set can be realized by utilizing PCA, firstly, training set samples are subjected to centering treatment, as shown in formula (1), wherein x is as follows i For a sample set, n is the total sample size. And calculating the correlation among the features through the covariance matrix, wherein the correlation is shown in a formula (2). And then carrying out eigenvalue decomposition on the covariance matrix, sequencing the calculated eigenvalues, selecting the first k eigenvectors and calculating corresponding eigenvectors to finish the eigenvalue extraction operation, and finally obtaining a new sample set after dimension reduction treatment.
According to the embodiment of the disclosure, the feature values of the feature marks are subjected to correlation among the features, so that the random forest model can be subjected to recognition training aiming at abnormal flow of a certain type of features, and the training efficiency and the recognition accuracy of the random forest model are further improved.
In step S105, the inputting the test set sample into the trained random forest model, comparing the feature output by the random forest model with the marking feature, and determining whether the trained random forest model meets the safety requirement according to the comparison result includes:
setting a threshold according to the safety requirement, and sequentially inputting test set samples into a random forest model;
if the consistency ratio of the features output by the random forest model and the marking features is smaller than a threshold value, the safety requirement is not met; and if the consistency ratio of the features output by the random forest model and the marking features is greater than a threshold value, meeting the safety requirement.
The security requirement includes checking for abnormal traffic requirements, it being understood that the higher the threshold is, the lower the security requirement is, and conversely, the lower the threshold is, the higher the security requirement is.
The random forest model can adopt CART classification tree, and the corresponding characteristics are output through gini values, which is shown in formula (3).
Gini(p k )=∑p k (1-p k ) (3)
According to the embodiment of the disclosure, the threshold is set according to the security requirement, and whether the random forest model meets the requirement is judged according to the threshold, so that the judgment standard of the random forest model can be adjusted according to the threshold, and the security of the firewall is regulated and controlled.
In step S105, whether the trained random forest model meets the safety requirement is judged according to the comparison result, if yes, the training of the random forest model is completed; if not, retraining the random forest model until the random forest model meets the safety requirement.
It will be appreciated that retraining the random forest model may begin with any of steps S101-S103, i.e., the random forest model fails to meet the security requirements, possibly for the following reasons: 1) The collected original flow data is not representative; 2) The feature marks are wrong; 3) In the process of randomly dividing the training set and the testing set, the sample characteristics of the training set and the testing set have larger difference. Thus, the random forest model may be retrained from any of steps S101-S103 according to specific reasons, and if a specific reason for failing to meet the security requirement cannot be analyzed, execution may be directly started from step S101.
In some embodiments, further comprising a test set sample supplemental training set comprising:
inputting the test set sample into the trained random forest model, marking the corresponding flow data sample characteristics when the random forest model cannot output the characteristics, and storing the corresponding flow data sample and the characteristics into the training set.
When the random forest model cannot output the characteristics, the corresponding flow data sample characteristics are obviously new characteristics, and the characteristics are stored in the training set, so that the identification range of the random forest model to the characteristics can be expanded.
According to the embodiment of the disclosure, the test set samples and the features which cannot be output are stored in the training set, so that the defect that the feature samples learned by the random forest model are insufficient due to the fact that the flow data samples with similar features are only distributed to the test set in the process of randomly distributing the training set and the test set can be avoided.
In the method, a new sample set formed through pretreatment is selected, a corresponding training set is selected and input into a random forest model for training, the random forest model is evaluated by using a test set, and parameters are continuously adjusted and optimized according to the prediction condition until the optimal detection effect is achieved. And finally, feeding back a data generation report of the abnormal flow to a disposal staff, and expanding the newly added abnormal flow to the existing training set so as to facilitate the next quick matching and disposal.
Fig. 2 shows a flowchart of an anomaly flow detection method 200 based on a random forest model, according to an embodiment of the present disclosure.
As shown in fig. 2, the abnormal flow detection method 200 based on the random forest model includes:
s201, collecting original flow data;
s202, performing feature labeling on the original flow data sample to generate a training sample;
s203, randomly dividing the training sample into a training set and a testing set;
s204, preprocessing a training set sample according to the characteristics, and training a random forest model according to the preprocessed training set sample;
s205, inputting a test set sample into the trained random forest model, comparing the characteristics output by the random forest model with the marking characteristics, and judging whether the trained random forest model meets the safety requirement according to the comparison result;
s206, inputting the flow data into the random forest model obtained by training by the training method of the random forest model, and judging whether the flow is abnormal or not according to the output characteristics of the random forest model.
In some embodiments, further comprising updating the training set samples, comprising:
when the random forest model cannot output the characteristics, the corresponding flow data characteristics are marked, and the corresponding flow data and the characteristics are stored in the training set.
It can be appreciated that when the random forest model cannot output features, it is apparent that the corresponding flow data features are new features that have not been trained before, and the new features are stored in the training set to enable the random forest model to perform supplemental training according to the corresponding data flow features.
According to the embodiment of the disclosure, the feature behaviors which do not appear can be identified by updating the training set samples from time to time, and meanwhile, the training set is updated and stored, so that the updating speed of the random forest model is adapted to the changing speed of the attack means, and whether the flow is abnormal or not can be rapidly and effectively detected.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.
The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.
Fig. 3 shows a block diagram of a training apparatus 300 of a random forest model according to an embodiment of the present disclosure.
As shown in fig. 3, the training apparatus 300 for random forest model includes:
an acquisition unit 301, configured to acquire raw flow data;
a marking unit 302, configured to perform feature marking on the original flow data sample, and generate a training sample;
a grouping unit 303, configured to randomly divide the training samples into a training set and a test set;
the training unit 304 is configured to pre-process the training set sample according to the features, and train the random forest model according to the pre-processed training set sample;
the test unit 305 is configured to input the test set sample into the trained random forest model, compare the feature output by the random forest model with the marking feature, and determine whether the trained random forest model meets the security requirement according to the comparison result.
In some embodiments, the method further includes a training set sample supplementing unit, configured to input the test set sample into the trained random forest model, mark the corresponding flow data sample feature when the random forest model cannot output the feature, and store the corresponding flow data sample and the feature into the training set.
Fig. 4 shows a block diagram of an anomaly flow detection device 400 based on a random forest model, according to an embodiment of the present disclosure.
As shown in fig. 4, the abnormal flow detection apparatus 400 based on the random forest model includes:
a model generating unit 401, configured to train to obtain a random forest model by using the training method 100 of the random forest model;
the judging unit 402 is configured to input the data traffic into the trained random forest model, and judge whether the data traffic is abnormal traffic according to the features output by the random forest model.
In some embodiments, the method further includes a training set sample updating unit, configured to mark corresponding flow data features when the random forest model cannot output the features, and store the corresponding flow data and features into the training set.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 5 shows a schematic block diagram of an electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
The device 500 comprises a computing unit 501 that may perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as method 100 or method 200. For example, in some embodiments, the method 100 or the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of method 100 or method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the method 100 or the method 200 by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (8)

1. A method for training a random forest model, comprising:
collecting original flow data;
performing feature labeling on the original flow data sample to generate a training sample;
randomly dividing the training samples into a training set and a testing set;
preprocessing a training set sample according to the characteristics, and training a random forest model according to the preprocessed training set sample;
inputting the test set sample into a trained random forest model, comparing the characteristics output by the random forest model with the marking characteristics, and judging whether the trained random forest model meets the safety requirement according to the comparison result;
the preprocessing of the training set sample according to the characteristics comprises the following steps:
decentralizing the training set samples, calculating the correlation among the features by using a covariance matrix, and marking the feature values according to the correlation;
sorting according to the characteristic values, and selecting a plurality of samples as a new sample set.
2. A method of training a random forest model as claimed in claim 1 wherein said characterizing said raw flow data samples comprises:
the raw flow data samples are classified according to characteristics,
marking the original flow data sample as normal and/or abnormal according to the classification result;
processing the default values: when the number of default values is small and the influence degree is small, discarding the corresponding default values, and when the number of default values is large or the influence degree is large, assigning the average value of the features to the corresponding default values;
processing the abnormal value: processing is performed according to the difference between the abnormal value and the standard value.
3. The method for training a random forest model according to claim 1, wherein the step of inputting the test set sample into the trained random forest model, comparing the feature output by the random forest model with the signature feature, and determining whether the trained random forest model meets the safety requirement according to the comparison result comprises:
setting a threshold according to the safety requirement, and sequentially inputting test set samples into a random forest model;
if the consistency ratio of the features output by the random forest model and the marking features is smaller than a threshold value, the safety requirement is not met; and if the consistency ratio of the features output by the random forest model and the marking features is greater than a threshold value, meeting the safety requirement.
4. A method of training a random forest model as claimed in claim 1 further comprising supplementing the training set with test set samples, comprising:
inputting the test set sample into the trained random forest model, marking the corresponding flow data sample characteristics when the random forest model cannot output the characteristics, and storing the corresponding flow data sample and the characteristics into the training set.
5. The abnormal flow detection method based on the random forest model is characterized by comprising the following steps of:
inputting flow data into a random forest model obtained by training the random forest model according to any one of claims 1-4, and judging whether the flow is abnormal according to the output characteristics of the random forest model.
6. The method for detecting abnormal traffic based on a random forest model according to claim 5, further comprising updating training set samples, comprising:
when the random forest model cannot output the characteristics, the corresponding flow data characteristics are marked, and the corresponding flow data and the characteristics are stored in the training set.
7. A training device for a random forest model, comprising:
the acquisition unit is used for acquiring original flow data;
the marking unit is used for carrying out characteristic marking on the original flow data sample to generate a training sample;
the grouping unit is used for randomly dividing the training samples into a training set and a testing set;
the training unit is used for preprocessing the training set samples according to the characteristics and training the random forest model according to the preprocessed training set samples;
the test unit is used for inputting the test set sample into the trained random forest model, comparing the characteristics output by the random forest model with the marking characteristics, and judging whether the trained random forest model meets the safety requirement according to the comparison result;
the training unit is specifically used for: decentralizing the training set samples, calculating the correlation among the features by using a covariance matrix, and marking the feature values according to the correlation;
sorting according to the characteristic values, and selecting a plurality of samples as a new sample set.
8. An abnormal flow detection device based on a random forest model, which is characterized by comprising:
a model generating unit, configured to train to obtain a random forest model by using the training method of the random forest model according to any one of claims 1 to 4;
and the judging unit is used for inputting the data flow into the trained random forest model and judging whether the data flow is abnormal flow or not according to the characteristics output by the random forest model.
CN202210279285.1A 2022-03-21 2022-03-21 Training method of random forest model, abnormal flow detection method and device Active CN114553591B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210279285.1A CN114553591B (en) 2022-03-21 2022-03-21 Training method of random forest model, abnormal flow detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210279285.1A CN114553591B (en) 2022-03-21 2022-03-21 Training method of random forest model, abnormal flow detection method and device

Publications (2)

Publication Number Publication Date
CN114553591A CN114553591A (en) 2022-05-27
CN114553591B true CN114553591B (en) 2024-02-02

Family

ID=81666505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210279285.1A Active CN114553591B (en) 2022-03-21 2022-03-21 Training method of random forest model, abnormal flow detection method and device

Country Status (1)

Country Link
CN (1) CN114553591B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114884843B (en) * 2022-06-10 2023-05-09 三峡大学 Flow monitoring system based on network audiovisual new media
CN115175191A (en) * 2022-06-28 2022-10-11 南京邮电大学 Mixed model abnormal flow detection system and method based on ELM and deep forest
CN116108880A (en) * 2023-04-12 2023-05-12 北京华云安信息技术有限公司 Training method of random forest model, malicious website detection method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160303A (en) * 2015-08-10 2015-12-16 上海闻泰电子科技有限公司 Fingerprint identification method based on mixed matching
CN114172748A (en) * 2022-02-10 2022-03-11 中国矿业大学(北京) Encrypted malicious traffic detection method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2434225A (en) * 2006-01-13 2007-07-18 Cytokinetics Inc Random forest modelling of cellular phenotypes
US10896385B2 (en) * 2017-07-27 2021-01-19 Logmein, Inc. Real time learning of text classification models for fast and efficient labeling of training data and customization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160303A (en) * 2015-08-10 2015-12-16 上海闻泰电子科技有限公司 Fingerprint identification method based on mixed matching
CN114172748A (en) * 2022-02-10 2022-03-11 中国矿业大学(北京) Encrypted malicious traffic detection method

Also Published As

Publication number Publication date
CN114553591A (en) 2022-05-27

Similar Documents

Publication Publication Date Title
CN114553591B (en) Training method of random forest model, abnormal flow detection method and device
CN110311829B (en) Network traffic classification method based on machine learning acceleration
JP7414901B2 (en) Living body detection model training method and device, living body detection method and device, electronic equipment, storage medium, and computer program
CN112148772A (en) Alarm root cause identification method, device, equipment and storage medium
CN111191767B (en) Vectorization-based malicious traffic attack type judging method
CN111798312A (en) Financial transaction system abnormity identification method based on isolated forest algorithm
CN113869449A (en) Model training method, image processing method, device, equipment and storage medium
CN112733146B (en) Penetration testing method, device and equipment based on machine learning and storage medium
CN109547466B (en) Method and device for improving risk perception capability based on machine learning, computer equipment and storage medium
CN112926621B (en) Data labeling method, device, electronic equipment and storage medium
CN111800430A (en) Attack group identification method, device, equipment and medium
CN112800919A (en) Method, device and equipment for detecting target type video and storage medium
CN111783812B (en) Forbidden image recognition method, forbidden image recognition device and computer readable storage medium
CN113449778B (en) Model training method for quantum data classification and quantum data classification method
CN114692778A (en) Multi-modal sample set generation method, training method and device for intelligent inspection
CN117633666A (en) Network asset identification method, device, electronic equipment and storage medium
CN116527399B (en) Malicious traffic classification method and device based on unreliable pseudo tag semi-supervised learning
CN111738290B (en) Image detection method, model construction and training method, device, equipment and medium
CN113033639A (en) Training method of abnormal data detection model, electronic device and storage medium
CN115713669B (en) Image classification method and device based on inter-class relationship, storage medium and terminal
CN116743474A (en) Decision tree generation method and device, electronic equipment and storage medium
CN115225373B (en) Network space security situation expression method and device under incomplete information condition
CN113361455B (en) Training method of face counterfeit identification model, related device and computer program product
CN115766176A (en) Network traffic processing method, device, equipment and storage medium
CN115130110A (en) Vulnerability mining method, device, equipment and medium based on parallel ensemble learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant