CN117729137A

CN117729137A - Feature generation method, device and equipment of network traffic data

Info

Publication number: CN117729137A
Application number: CN202410176841.1A
Authority: CN
Inventors: 周洪海; 金志浩; 谢丽萍; 赵玉薇
Original assignee: Jinshu Information Technology Suzhou Co ltd
Current assignee: Jinshu Information Technology Suzhou Co ltd
Priority date: 2024-02-08
Filing date: 2024-02-08
Publication date: 2024-03-19

Abstract

The invention provides a method, a device and equipment for generating characteristics of network traffic data, and aims to improve accuracy of network behavior analysis and training efficiency of a machine learning model. The method first obtains original network traffic data. The data is then preprocessed, including time-stamping into relative time units, digitizing the IP address and port number, and encoding of protocol type and TCP flags, to form a two-dimensional data sequence. Further, a two-dimensional data sequence is scanned along a time axis using sliding windows, and statistics of data points and a TCP marker-based emotion score are calculated within each window. And finally, carrying out transverse splicing on the new data points containing statistical and emotion information generated for each window and the two-dimensional data sequence to form a comprehensive feature set. By the method, network flow dynamics can be more comprehensively understood, and the quality of data analysis and model prediction is effectively improved.

Description

Feature generation method, device and equipment of network traffic data

Technical Field

The present invention relates to the field of computer networks, and in particular, to a method, an apparatus, and a device for generating characteristics of network traffic data.

Background

In the context of modern network technology, network traffic data analysis becomes critical, especially in the field of network security and performance monitoring. With the increasing and complex network activities, traditional network traffic analysis methods face increasing challenges. These challenges mainly include how to efficiently process and interpret the ever-increasing amount of data, and how to extract valuable information from such data. The prior art relies mainly on the basic collection and analysis of network traffic data, including statistical processing of information such as IP addresses, port numbers, protocol types, and packet sizes. While these methods are effective in addressing underlying network problems, they are struggling when analyzing complex and dynamically changing network environments.

One major limitation is that conventional approaches often lack a deep understanding of the underlying patterns and dynamic behavior of network traffic data. For example, in the field of security analysis, simple statistical analysis may not accurately identify complex network attack patterns, such as distributed denial of service (DDoS) attacks or Advanced Persistent Threat (APT). In addition, because network traffic data has a high time series characteristic, the conventional method has defects in capturing the change and trend of the data with time.

Another limitation is that existing analysis tools often fail to take advantage of potential "emotion" information in network traffic data that may indicate changes in particular patterns of behavior or user activity in the network. As can be seen, the prior art has significant limitations in processing large-scale, complex and dynamically changing network traffic data, particularly in scenarios requiring real-time analysis and response.

Therefore, it is necessary to develop a new feature generation method of network traffic data.

Disclosure of Invention

The application provides a characteristic generation method of network traffic data, which is used for improving the accuracy of network behavior analysis.

The method for generating the characteristics of the network traffic data comprises the following steps:

acquiring original network traffic data, wherein each piece of original network traffic data comprises a time stamp, a source IP address, a destination IP address, a source port, a destination port, a protocol type, a data packet size and a TCP mark;

preprocessing the original network traffic data to create a two-dimensional data sequence, wherein the two-dimensional data sequence is arranged according to a time sequence;

setting a sliding window for scanning the two-dimensional data sequence along a time axis; respectively calculating statistical information of the data points and emotion scores based on TCP marks for the data points in the sliding window, wherein the statistical information comprises an average value of the data packet sizes in the sliding window and the number of the data points; the emotion score is calculated according to emotion values corresponding to the pre-allocated TCP marks;

Generating a new data point for each window including statistics and emotion scores, thereby forming a new data sequence;

and transversely splicing the new data sequence with the two-dimensional data sequence to form a comprehensive feature set for data analysis of network behaviors and training of a machine learning model.

Still further, the preprocessing the original network traffic data includes:

converting the timestamp to a relative time unit, converting the IP addresses in the source IP address and the destination IP address to numerical representations, normalizing the port numbers in the source port and the destination port to integer form, converting the protocol type to integer coding, converting the data packet size to numerical representations, and converting the TCP flag to a set of binary value data.

Still further, the raw network traffic dataComprises->The network traffic data is striped,the method comprises the steps of carrying out a first treatment on the surface of the First->The bar network traffic data is expressed as:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>；/>In order to be a time stamp,for the source IP address>For the destination IP address>For the source port->For destination port +.>For protocol type +.>Packet size,/->Is a TCP flag.

Further, the window size of the sliding window The setting is performed according to the following formula:

wherein,for window adjustment factor, ++>。

Further, the sliding mechanism of the sliding window is as follows:

when (when)When the window range covers network traffic data +.>Data of->To->；

When (when)When the window range covers network traffic data +.>Data of->To->。

Further, the emotion value of the SYN flag bit in the TCP flag is set to 1, indicating an aggressive connection attempt; the emotion value of the ACK flag bit is set to 0, and represents a neutral confirmation response; the emotion value of the FIN flag bit is set to-1, and the emotion is slightly negative; the emotion value of the RST marker bit is set to be-2, and emotion is negative; the emotion value of the PSH marker bit is set to be 0.5, and the emotion is slightly positive; the emotion value of the URG flag is set to-0.5, and the emotion is slightly negative.

Still further, the feature generation method further includes:

establishing a machine learning model based on the comprehensive feature set, the model being trained to distinguish between normal and abnormal network traffic behaviors;

applying the machine learning model to new real-time sample data, and detecting whether abnormal network traffic behaviors exist;

When abnormal network traffic behavior is detected, a notification is sent to a network administrator or corresponding security system.

The application provides a characteristic generating device of network traffic data, comprising:

the device comprises an acquisition unit, a Transmission Control Protocol (TCP) processing unit and a transmission Control Protocol (CPU) processing unit, wherein the acquisition unit is used for acquiring original network traffic data, and each piece of original network traffic data comprises a time stamp, a source IP address, a destination IP address, a source port, a destination port, a protocol type, a data packet size and a TCP mark;

the creating unit is used for preprocessing the original network traffic data so as to create a two-dimensional data sequence, wherein the two-dimensional data sequence is arranged according to a time sequence;

a scanning unit, configured to set a sliding window, where the sliding window is used to scan the two-dimensional data sequence along a time axis; respectively calculating statistical information of the data points and emotion scores based on TCP marks for the data points in the sliding window, wherein the statistical information comprises an average value of the data packet sizes in the sliding window and the number of the data points; the emotion score is calculated according to emotion values corresponding to the pre-allocated TCP marks;

a generation unit for generating a new data point comprising statistical information and emotion scores for each sliding window, thereby forming a new data sequence;

And the splicing unit is used for transversely splicing the new data sequence with the two-dimensional data sequence to form a comprehensive feature set and used for data analysis of network behaviors and training of a machine learning model.

a processor;

and a memory for storing a program which, when read and executed by the processor, performs the above-described feature generation method of network traffic data.

The present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the above-described method of feature generation of network traffic data.

The beneficial effects of the technical scheme that this application provided include:

(1) By the aid of multiple dimensions including time stamp, IP address, port information, protocol type, data packet size, TCP mark and the like, key characteristics of network traffic can be comprehensively captured, and a rich data basis is provided for deep analysis. (2) The method can effectively capture the time sequence characteristics of the network traffic data by scanning the data sequence along the time axis by utilizing the sliding window, which is important for understanding the network behavior mode. (3) And a emotion scoring mechanism based on a TCP mark is introduced, so that a new view angle is provided for network traffic data analysis. This approach may reveal behavioral motives and patterns behind traffic data, which is particularly valuable for network security analysis. (4) By transversely splicing the newly generated data sequence and the original data sequence, the formed comprehensive feature set can reflect the characteristics of network flow more comprehensively, and the accuracy and efficiency of subsequent model training are improved. (5) The comprehensive feature set is used as input, so that the performance of the machine learning model in the aspect of network flow behavior analysis can be improved, and the method has remarkable application value in the fields of anomaly detection, intrusion detection and the like.

Drawings

Fig. 1 is a flowchart of a method for generating characteristics of network traffic data according to a first embodiment of the present application.

Fig. 2 is a schematic diagram of a feature generating device for network traffic data according to a second embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.

The first embodiment of the application provides a feature generation method of network traffic data. Referring to fig. 1, a schematic diagram of a first embodiment of the present application is shown. A method for generating characteristics of network traffic data according to a first embodiment of the present application is described in detail below with reference to fig. 1.

Step S101: original network traffic data is obtained, wherein each piece of original network traffic data comprises a time stamp, a source IP address, a destination IP address, a source port, a destination port, a protocol type, a data packet size and a TCP mark.

Step S101 involves a process of acquiring original network traffic data. This step is the basis for the overall feature generation method, which ensures the accuracy and validity of the subsequent analysis. Raw network traffic data refers to packet information transmitted over a network interface, which is typically recorded and provided by network devices (e.g., routers, switches, or servers).

Each piece of original network traffic data contains the following key information:

timestamp: the specific point in time at which the data packet was captured is recorded. The accuracy of the time stamps is critical to the subsequent analysis of the time series nature of the network traffic data.

Source IP address and destination IP address: these addresses identify the sender and the receiver of the data packet. Information on IP addresses is necessary to analyze the source and destination of network traffic.

Source and destination ports: these port numbers provide information about the particular network service used by the packet. The port number helps identify the type of network traffic and the application.

Protocol type: the communication protocol (e.g., TCP, UDP, etc.) used by the data packets is identified, and the type of protocol is critical to understanding the nature of the network traffic.

Packet size: the size of the data packet (in bytes) provides information about the amount of transmission load.

TCP flag: these flags (e.g., SYN, ACK, FIN, etc.) describe the state of the TCP connection and the nature of the data packets, which are important to understand the context of network communications.

TCP (Transmission Control Protocol ) is a commonly used network communication protocol, with flag bits used to control and manage the network communication process. In network traffic data, the TCP flag provides important information about the state and behavior of the packet, which is critical to understanding the nature of the network traffic.

The TCP protocol defines a plurality of flags, each of which plays a specific role in network communications:

1. SYN (synchronization sequence number): for initializing the sequence number when establishing a connection. The SYN flag is set at the beginning of the TCP three-way handshake, indicating that a new connection is started.

2. ACK (acknowledgement): indicating that the recipient has successfully received the data. During data transmission, an ACK flag is typically set, indicating that previously transmitted data has been acknowledged.

3. FIN (end): for releasing a connection. Setting the FIN flag indicates that the sender has not transmitted data ready to end the connection.

4. RST (reset): for abruptly terminating the connection. The RST flag is set when the connection is in error or a forced shutdown is required.

5. PSH (push): the recipient is prompted to immediately process the data. Setting the PSH flag generally indicates that the data should be passed to the application as soon as possible, rather than queued in a buffer.

6. URG (emergency): indicating that there is urgent data in the data packet. When the URG flag is set, it indicates that there is specific urgent data in the data packet to be processed with priority.

The acquisition of raw network traffic data is typically performed by a network monitoring tool or specialized network analysis software. These tools are capable of capturing packets of data over the network interface in real time and recording the detailed information described above. In some embodiments, this information may also be extracted from a log file or data flow record of the network device. The acquired data should remain in the original format to ensure accuracy of the analysis.

The raw network traffic data obtained is a key input for analyzing network behavior and building machine learning models. The details and integrity of these data directly affect the quality and effect of feature generation in subsequent steps. Thus, step S101 involves not only the collection of data, but also ensuring the accuracy and availability of data.

The original network traffic dataCan be expressed as:

；

wherein,representing the total amount of original network traffic data.

First, theThe bar network traffic data is expressed as:

In summary, step S101 plays a crucial role in the overall feature generation method, and lays a solid foundation for subsequent data preprocessing, feature extraction and model training.

Step S102: the raw network traffic data is preprocessed to create a two-dimensional data sequence.

In this embodiment, the preprocessing includes converting the timestamp into a relative time unit, converting the IP addresses of the source IP address and the destination IP address into a numerical representation, normalizing the port numbers of the source port and the destination port into an integer form, converting the protocol type into an integer code, converting the packet size into a numerical representation, and converting the TCP flag into a set of binary values.

Step S102 is a key step that involves performing a series of preprocessing operations on the raw network traffic data acquired. The purpose of these preprocessing operations is to convert the raw data into a form more suitable for analysis and feature extraction.

The details of the pretreatment operation are as follows:

1. conversion of the time stamp: in this step, the time stamp in the original data is converted into relative time units. This typically involves converting the timestamp from its original format (e.g., UNIX timestamp) to a number of seconds or other units of time from some particular point in time (e.g., the data collection start time). This transformation facilitates time series comparison and processing in subsequent analysis.

2. Numerical conversion of IP address: the source IP address and the destination IP address in the original data are converted into digital representations. This typically involves converting the IP address in a dot decimal format to an integer form, making the IP address easier to use in subsequent calculations and analysis.

3. Standardization of port numbers: the source port and destination port numbers are normalized to integer form. Since the port number is originally an integer, this process may involve range checking and format unification to ensure consistency and accuracy of the data.

4. Integer encoding of protocol types: the network protocol type (e.g., TCP, UDP) is converted to integer encoding. Such encoding facilitates subsequent analysis using the protocol type as a numerical feature.

5. Numerical representation of packet size: the packet size (typically in bytes) is directly characterized as a numerical value.

6. Binary translation of TCP flags: the TCP flag (e.g., SYN, ACK, FIN, etc.) is converted to a set of binary values. This allows the TCP signature to be used in computer processing and analysis, particularly in subsequent emotion scoring.

Through the preprocessing step, the original network traffic data is converted into a two-dimensional data sequence. In this sequence, each row represents a data point (a packet or a group of packets) and each column represents a feature (e.g., time, IP address, port number, etc.). This two-dimensional format is the basis for data analysis and feature extraction in subsequent steps. It is noted that this two-dimensional data sequence may be constructed in time order.

The preprocessing step is critical to ensure consistency of the data and accuracy of the analysis. By converting the raw data into a standardized and unified format, data analysis and feature extraction can be performed more efficiently. In addition, preprocessing also lays a solid foundation for advanced analysis in subsequent steps, such as sliding window analysis and training of machine learning models.

In summary, step S102 is not only an important component of the data preparation phase, but is also critical to the successful implementation of the overall feature generation method. By accurately and efficiently preprocessing the original data, the step can provide high-quality data, and support is provided for deep analysis of network traffic and model training.

Step S103: setting a sliding window for scanning the two-dimensional data sequence along a time axis; respectively calculating statistical information of the data points and emotion scores based on TCP marks for the data points in each sliding window, wherein the statistical information comprises an average value of the sizes of the data packets in the windows and the number of the data points; and calculating the emotion scores according to emotion values corresponding to the pre-allocated TCP marks.

Step S103 is a key analysis process in this embodiment, which involves scanning the preprocessed two-dimensional data sequence along the time axis using a sliding window, and performing statistical analysis and emotion scoring of the data points within each window.

In this step, a sliding window needs to be set first. This window is used to capture successive data points over a time series, providing a view of the time context. The size of the window (i.e., the number of data points covered) is configurable and can be adjusted based on the characteristics of the network traffic data and the analysis requirements. The selection of an appropriate window size is critical to capturing dynamic changes in network behavior.

Once the window size is set, the window will move along the time series, progressively covering data points for different time periods. This process is similar to moving a "magnifying glass" over a time series, observing and analyzing a set of consecutive data points at each location. The sliding window movement ensures that the entire time sequence is covered over the whole area without missing any time points.

At each window position, a calculation of statistical information is performed on the data points within the window. This includes calculating an average of the packet sizes within a window, which helps to understand the overall trend of network traffic over that time window. In addition, the number of data points within the window is also calculated, which helps to quantify the density of network activity over a particular period of time.

In addition to statistical information calculations, the data points within each window will also be emotion scored based on the TCP tokens. This scoring process involves assigning predefined emotion values to different TCP markers to evaluate behavioral patterns and potential emotional propensity in the network traffic data. For example, frequent SYN flags may indicate an increase in connection attempts, while a high percentage of RST flags may indicate network connection problems or aggression.

Based on the statistics and emotion scores within each window, this step generates a new data point that integrates the statistics and emotion scores. The innovation of the method is that the quantitative characteristic of network traffic is considered, and qualitative analysis based on behaviors is integrated, so that a richer and deeper feature set is provided for subsequent data analysis and model training.

Table 1, two-dimensional data sequence example

Data points	Time stamp (second)	Data packet size	SYN	ACK	RST	FIN	PSH	URG
									1	1704067200	1500	1	0	0	0	0	0
2	1704067205	40	0	1	0	0	0	0
									3	1704067210	600	0	0	1	0	0	0
4	1704067215	1200	0	0	0	1	0	0
									5	1704067220	75	0	0	0	0	1	0
6	1704067225	50	0	0	0	0	0	1

In connection with the two-dimensional data sequence example provided in Table 1 above, it is explained how statistics of data points and emotion scores based on TCP markers are calculated for data points within a sliding window, respectively.

Assume that the TCP flag bit and its emotion value are as follows:

SYN (establish connection): +1 (active)

ACK (acknowledgement): 0 (neutral)

RST (reset connection): -2 (negative)

FIN (end connection): 1 (slight negative)

PSH (push data): +0.5 (slight active)

URG (Emergency data): -0.5 (slight negative)

If one TCP (transmission control protocol) zone bit in the data point is 1, summing and calculating according to the emotion value of the TCP zone bit, and obtaining the emotion score of the data point. And further calculating emotion scores of all the data points in the window, and summing the emotion scores of all the data points in the window to obtain the emotion scores of the data points in the window.

Assuming that the window size of the sliding window is fixed at 3, the first window position covers the 1 st to 3 rd data points:

number of data points: 3

Packet size average: (1500+40+600)/3= 713.33

TCP flag: SYN (1,0,0,0,0,0), ACK (0,1,0,0,0,0), RST (0,0,1,0,0,0)

Emotion score: SYN (+1) +ack (0) +rst (-2) = -1

The second window position covers the 2 nd to 4 th data points:

number of data points: 3

Packet size average: (40+600+1200)/3= 613.33

TCP flag: ACK (0,1,0,0,0,0), RST (0,0,1,0,0,0), FIN (0,0,0,1,0,0)

Emotion score: ACK (0) +rst (-2) +fin (-1) = -3

The third window position covers the 3 rd to 5 th data points:

number of data points: 3

Packet size average: (600+1200+75)/3=625

TCP flag: RST (0,0,1,0,0,0), FIN (0,0,0,1,0,0), PSH (0,0,0,0,1,0)

Emotion score: RST (-2) +FIN (-1) +PSH (+0.5) = -2.5

The fourth window position covers the 4 th to 6 th data points:

number of data points: 3

Packet size average: (1200+75+50)/3= 441.67

TCP flag: FIN (0,0,0,1,0,0), PSH (0,0,0,0,1,0), URG (0,0,0,0,0,1)

Emotion score: FIN (-1) +psh (+0.5) +urg (-0.5) = -1

The fifth window position covers the 5 th to 6 th data points:

number of data points: 2

Packet size average: (75+50)/2=62.5

TCP flag: PSH (0,0,0,0,1,0), URG (0,0,0,0,0,1)

Emotion score: PSH (+0.5) +urg (-0.5) =0

The sixth window position covers the 6 th data point:

number of data points: 1

Packet size average: (50) 1=50

TCP flag: URG (0,0,0,0,0,1)

Emotion score: URG (-0.5) = -0.5

The results of the calculations in the six windows form a new data sequence. The new data sequence is transversely spliced with the two-dimensional data sequence in fig. 1, so that a splicing result shown in table 2 can be obtained, and the splicing result can be used as a comprehensive feature set for data analysis of network behaviors and training of a machine learning model.

Table 2, new data sequence and example of lateral concatenation of the two-dimensional data sequence

Data points	Time stamp (second)	Data packet size	SYN	ACK	RST	FIN	PSH	URG	Number of data points	Data packet average	Emotion score
												1	1704067200	1500	1	0	0	0	0	0	3	713.33	-1
2	1704067205	40	0	1	0	0	0	0	3	613.33	-3
												3	1704067210	600	0	0	1	0	0	0	3	625	-2.5
4	1704067215	1200	0	0	0	1	0	0	3	441.67	-1
												5	1704067220	75	0	0	0	0	1	0	2	62.5	0
6	1704067225	50	0	0	0	0	0	1	1	50	-0.5

In this embodiment, the window size of the sliding windowThe setting is performed according to the following formula:

wherein,and adjusting the factor for the window.

The sliding window is used to analyze network traffic time series data. Window size [ ] ) The setting of (2) directly affects the granularity of the data analysis and the accuracy of the results. Therefore, defining an appropriate window size is critical.

Window sizeIs determined by the following formula:

where j is an integer, called a window adjustment factor, ranging from 1 to 10. This formula ensures that the window size is always odd, which is a common practice in many data analysis applications, because it allows the window to have a well-defined center point.

Window adjustment factorThe function of (2) is to control the size of the sliding window. By changing->The number of data points covered by the window can be dynamically adjusted so as to adapt to different data characteristics and analysis requirements. For example:

when (when)Smaller (e.g., 1 or 2) windows are smaller, enabling rapid changes in the data to be captured.

When (when)When larger (such as 9 or 10), the window is larger, which is suitable for observing long-term trends of the data.

In particular implementation, selectThe values of (2) may be based on characteristics of the data such as rate of change, noise level, and desired temporal resolution. For example, for high frequency network attack detection, a smaller window may be required to capture the rapidly changing traffic pattern. Conversely, for long-term trends in flow analysis, a larger window may be more appropriate.

The choice of window size has a significant impact on the results of the data analysis. Smaller windows may lead to high sensitivity but may also be accompanied by more false positives. Larger windows may provide smoother views of the data, but may miss some important detail changes.

The setting of the sliding window size is a key factor in achieving efficient and accurate data analysis. By the formula and window adjustment factorThe embodiment provides a flexible method for adapting to different network flow data characteristics, and ensures that an analysis result is accurate and reliable.

In this embodiment, the sliding mechanism of the sliding window is as follows:

The sliding window mechanism is used to analyze network traffic time series data. This mechanism allows the window to be slid along the data sequence, capturing the local nature of the data at different points in time.

The sliding window is made up of a series of consecutive data points, the size of the window being determined by the formula defined above. The windows are slid along the network traffic data sequence and the data within each window is analyzed.

The sliding window operates as follows:

1. when (when)When (1): the window starts at the first data point of the data sequence +.>And expands backwards to +.>Data points>. This is the initial position of the window at the beginning of the data sequence.

2. When (when)When (1): as the window slides forward, it covers from the first data point +.>To->Data points>. At this stage the window gradually expands to its maximum size.

3. When (when)When (1): the window is now in the middle part of the data sequence. Window covering from->Data points>To->Data points>A fixed size is maintained.

4. When (when)When (1): the window reaches the end of the data sequence. It is from->Data points>Starting until the last data point of the sequence +.>。

This sliding mechanism allows the window to traverse the entire data sequence in a consistent and orderly fashion. At the beginning and end of the sequence, the window size may not be equal to the size in the middle part of the sequence, since the window size gradually adapts to the boundary conditions of the data sequence.

By continuously sliding the windows and analyzing the data within each window, the present embodiment is able to continuously monitor changes in network traffic, capturing possible trends and patterns, which are critical for applications such as anomaly detection, traffic prediction, and the like.

The sliding window mechanism is one of the core parts of this embodiment. It not only provides an efficient method to continuously analyze network traffic data, but also ensures a high degree of adaptability and flexibility of the analysis process.

In this embodiment, the emotion value of the SYN flag bit in the TCP flag is set to 1, which indicates a positive connection attempt; the emotion value of the ACK flag bit is set to 0, and represents a neutral confirmation response; the emotion value of the FIN flag bit is set to-1, and the emotion is slightly negative; the emotion value of the RST marker bit is set to be-2, and emotion is negative; the emotion value of the PSH marker bit is set to be 0.5, and the emotion is slightly positive; the emotion value of the URG flag is set to-0.5, and the emotion is slightly negative.

In this embodiment, the emotion value setting of the TCP flag is critical to understanding and interpreting the network traffic data. Each TCP flag bit is assigned an emotion value that reflects the emotion or meaning of a particular event in the network communication.

According to different characteristics of network traffic, the embodiment allocates a specific emotion value to each TCP zone bit, as follows:

1. SYN flag bit (emotion value: +1):

-stands for "establish connection".

An emotion value of +1, indicating an aggressive connection attempt, is a sign of the start of the network communication.

2. ACK flag bit (emotion value: 0):

-stands for "acknowledgement".

An emotion value of 0, representing a neutral acknowledgement response, is a common interaction in network communications.

3. FIN flag bit (emotion value: -1):

-stands for "end connection".

An emotion value of-1, representing a slightly negative emotion, usually meaning the end of a normal communication.

4. RST flag bit (emotion value: -2):

-stands for "reset connection".

An emotion value of-2, representing a moderately negative emotion, usually associated with a connection error or abnormality.

5. PSH flag bit (emotion value: +0.5):

-represents "push data".

An emotion value of +0.5, representing a slightly positive emotion, is a sign of data transmission.

6. URG flag bit (emotion value: -0.5):

-representing "emergency data".

An emotion value of-0.5, representing a slightly negative emotion, typically used to indicate an emergency situation.

When these emotion values are used to analyze network traffic data, the emotion score for each TCP packet is the sum of the emotion values of the respective flag bits whose value is 1. This approach allows for the quantitative assessment of emotional propensity of network traffic to identify normal, abnormal, or potentially threatening behavior.

By applying these emotion values to each TCP packet in the network traffic, the present embodiment can continuously and accurately monitor the emotion trend of the network traffic. This analysis method provides a new perspective to understand and interpret network activities, and is of great value in particular in the field of network security and traffic management.

The method not only improves the dimension and depth of network traffic analysis, but also provides a brand new angle for understanding network behavior patterns.

In summary, step S103 plays a crucial role in the present embodiment. Through time sequence scanning of the sliding window, by combining calculation of statistical information and emotion scoring based on TCP marks, the step can capture quantitative characteristics of data and can further analyze the behavior mode of network traffic. The method provides a brand new view angle for the characteristic generation of the network flow, is helpful for more accurately analyzing the network behavior and improving the training effect of the machine learning model in the aspect of network flow data analysis.

Step S104: a new data point is generated for each window that includes statistics and emotion scores, thereby forming a new data sequence.

Step S104 is a core element in this embodiment, and involves generating new data points according to the analysis result in each window. This step plays a key role in the data conversion and feature extraction process.

Within each window, statistical information calculation of data points and emotion scoring based on TCP markers have been completed. The purpose of step S104 is to integrate the analysis results to form a new, integrated data point. Each new data point contains the following components:

1. statistical information: including the average value of the packet size and the number of data points within the window. This information provides a quantitative description of the network traffic within the window.

2. Emotion scoring: based on analysis of TCP marks in the window, the behavior mode and potential emotion tendencies of the network traffic are reflected.

These new data points not only contain key features of the original data, but also incorporate a deep understanding of the network traffic behavior.

The process of generating new data points involves the following steps:

1. data aggregation: the data within each window is aggregated, including calculating statistics and emotion scores.

2. And (3) data synthesis: the statistical information and emotion scores are combined to form a new data point in multiple dimensions.

3. Data normalization: to ensure consistency of the data points in subsequent analysis, normalization of the new data points may be required.

The new data points generated provide a rich basis for feature extraction of network traffic data. By combining the quantified statistics with a qualitative emotion score, these data points can more fully characterize network traffic. In machine learning and data analysis applications, these new data points can provide deeper holes that help identify and understand complex network behavior patterns.

Step S104 plays a critical role in the present embodiment. The method not only marks the conversion from the original data to the characteristic data, but also introduces an innovative analysis dimension, and provides strong support for the deep analysis of network traffic and the training of a machine learning model by integrating data analysis results of different types. The implementation of the step ensures the richness of data and the comprehensiveness of analysis, and is a key link for realizing efficient and accurate network flow analysis.

Step S105: and transversely splicing the new data sequence with the two-dimensional data sequence to form a comprehensive feature set for data analysis of network behaviors and training of a machine learning model.

Step S105 is the last key step of the feature generation method provided in this embodiment, and involves laterally splicing the newly generated data sequence with the preprocessed original two-dimensional data sequence, so as to form a comprehensive feature set. This step is critical to the final data analysis and machine learning model training.

The transverse splicing process is as follows:

1. data preparation: before performing the transverse stitching, it is ensured that the newly generated data sequence and the original two-dimensional data sequence are compatible in format and dimension. This may involve further formatting of the data to ensure that the data is aligned.

2. Data merging: the new data sequence is combined with the original two-dimensional data sequence in the column direction. In the merging process, the new features of each data point (i.e., the statistical information and emotion scores generated in step S104) are added next to the original data point, thereby expanding the original feature set.

3. And (3) result inspection: after merging, the result data set is checked to ensure the consistency and integrity of the data. This step is to verify the correct merging of the data and to confirm that there is no data loss or format error.

The comprehensive feature set generated in step S105 merges the basic attribute of the original network traffic data and the new feature obtained by advanced analysis. The advantage of this comprehensive feature set is that:

1. enhanced features represent: by combining the basic network traffic information with the behavior patterns extracted from the data, the comprehensive feature set provides a more comprehensive view of network behavior analysis.

2. The training effect of the model is improved: in training a machine learning model, this rich feature set can help the model better understand and predict network behavior, thereby improving the performance of the model.

3. Flexibility and scalability: the method has high flexibility and expandability in processing network traffic data of different types and scales.

Step S105 not only marks the completion of the data preprocessing to feature extraction, but also lays a foundation for final data analysis and machine learning model training. By combining new data points with the raw data, the present embodiment can provide a more comprehensive and deep feature set that facilitates more accurate analysis and prediction of network behavior.

In this embodiment, the feature generating method further includes:

The present embodiment applies machine learning techniques to the analysis of network traffic data to distinguish between normal and abnormal network behavior. The core of the method is a novel machine learning model which is specially designed and trained for network traffic characteristic data.

Building a machine learning model:

1. model design:

a hybrid model based on a self-encoder (Autoencoder) and decision tree integration (e.g., random forest) is used.

The self-encoder is used to effectively reduce feature dimensions and extract key features from the integrated feature set.

Decision tree integration is used to classify normal and abnormal traffic.

2. Model training:

training is performed using historical network traffic data, including marked normal and abnormal traffic samples.

By continuously adjusting the model parameters, the classification accuracy of the model is optimized.

3. Model test and verification:

the model is tested under different network environments and traffic configurations.

The accuracy and robustness of the model is assessed using cross-validation and other statistical methods.

The detection of abnormal network traffic includes:

1. real-time data processing:

the new real-time sample data is first subjected to the same preprocessing step as the original data.

The processed data is input into a trained machine learning model.

2. Abnormality detection:

the model analyzes the data and identifies samples that do not conform to the normal flow pattern.

The model marks the abnormal samples based on the learned features and behavior patterns.

3. Notification mechanism:

when the model detects abnormal behavior, the system automatically alerts a network administrator or security system.

Alarms include detailed information of abnormal behavior such as time of occurrence, network parameters involved, etc.

The following describes the hybrid model in detail:

the present hybrid model incorporates self-encoder and decision tree integration (e.g., random forest) to efficiently process and analyze network traffic data to distinguish between normal and abnormal behavior. The composition of the model, the input and output, the connection relationship, and the implementation of the respective parts will be described in detail below.

The model composition and flow include:

1. model composition:

a self-encoder section: the method is used for feature dimension reduction and key feature extraction.

Decision tree integration (e.g., random forest): for classification tasks.

2. The working flow is as follows:

input: a comprehensive feature set of network traffic data.

Self-encoder processing: and inputting the comprehensive feature set into the encoder for feature dimension reduction.

Decision tree integration processing: and inputting the feature subjected to dimension reduction into a decision tree for integration and classification.

And (3) outputting: classification results, identifying normal or abnormal traffic.

The following description is given to the self-encoder section:

1. input: the comprehensive feature set of the network traffic includes a timestamp, an IP address, port information, a protocol type, a packet size, a TCP flag, etc.

2. Implementation of the self-encoder section:

a neural network is used to construct the self-encoder. The encoder section reduces the feature dimension layer by layer. The decoder partially reconstructs the input data for error calculation during the training process.

3. And (3) outputting: and (5) representing key characteristics after dimension reduction.

The decision tree integration section is described below:

1. input: the dimension reduction feature output from the encoder.

2. The implementation mode is as follows:

a decision tree integration is constructed using a random forest algorithm. In the training process, the random forest trains and predicts the samples by utilizing a plurality of decision trees. The diversity of decision trees is achieved by randomly selecting features and data samples.

3. And (3) outputting: the classification results indicate whether each sample is normal or abnormal.

The hybrid model provides a method for efficiently processing and analyzing network flow data through the effective combination of a self-encoder and a random forest. The combination of the feature dimension reduction capability of the self-encoder and the classification capability of the random forest makes the model effective and accurate in processing large-scale and high-dimensional network traffic data. The design of the hybrid model provides a new technical means for the network security field and can be implemented and applied by those skilled in the art.

In the foregoing embodiment, a method for generating characteristics of network traffic data is provided, and correspondingly, the application further provides a device for generating characteristics of network traffic data. Refer to fig. 2, which is a schematic diagram of an embodiment of a feature generating device for network traffic data according to the present application. Since this embodiment, i.e. the second embodiment, is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description of the method embodiment for relevant points. The device embodiments described below are merely illustrative.

The feature generating device for network traffic data provided in the second embodiment of the present application includes:

an obtaining unit 201, configured to obtain original network traffic data, where each piece of original network traffic data includes a timestamp, a source IP address, a destination IP address, a source port, a destination port, a protocol type, a packet size, and a TCP flag;

a creating unit 202, configured to pre-process the original network traffic data, thereby creating a two-dimensional data sequence, where the two-dimensional data sequence is arranged in a time sequence;

a scanning unit 203 for setting a sliding window for scanning the two-dimensional data sequence along a time axis; respectively calculating statistical information of the data points and emotion scores based on TCP marks for the data points in the sliding window, wherein the statistical information comprises an average value of the data packet sizes in the window and the number of the data points; the emotion score is calculated according to emotion values corresponding to the pre-allocated TCP marks;

a generating unit 204, configured to generate a new data point including statistical information and emotion scores for each window, thereby forming a new data sequence;

And the splicing unit 205 is used for transversely splicing the new data sequence and the two-dimensional data sequence to form a comprehensive feature set for data analysis of network behaviors and training of a machine learning model.

A third embodiment of the present application provides an electronic device, including:

a processor;

and a memory for storing a program which, when read and executed by the processor, performs the feature generation method of network traffic data provided in the first embodiment of the present application.

A fourth embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the feature generation method of network traffic data provided in the first embodiment of the present application.

While the preferred embodiment has been described, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications without departing from the spirit and scope of the present invention, so that the scope of the present invention shall be defined by the claims of the present application.

Claims

1. A method for generating characteristics of network traffic data, comprising:

generating a new data point for each sliding window including statistics and emotion scores, thereby forming a new data sequence;

2. The feature generation method according to claim 1, wherein the preprocessing the raw network traffic data includes:

3. The feature generation method of claim 1, wherein the raw network traffic dataComprises->Strip network traffic data,/->The method comprises the steps of carrying out a first treatment on the surface of the First->Strip network traffic data->Represented asThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>；/>For time stamp->For the source IP address>For the destination IP address>For the source port->For destination port +.>As a type of protocol it is possible to use,packet size,/->Is a TCP flag.

4. A feature generation method according to claim 3 wherein the window size of the sliding windowThe setting is performed according to the following formula:

wherein,for window adjustment factor, ++>。

5. The feature generation method according to claim 4, wherein a sliding mechanism of the sliding window is as follows:

when (when)When the window range covers network traffic data +.>Data of- >To->；

6. The feature generation method according to claim 5, wherein an emotion value of a SYN flag bit in the TCP flag is set to 1, indicating a positive connection attempt; the emotion value of the ACK flag bit is set to 0, and represents a neutral confirmation response; the emotion value of the FIN flag bit is set to-1, and the emotion is slightly negative; the emotion value of the RST marker bit is set to be-2, and emotion is negative; the emotion value of the PSH marker bit is set to be 0.5, and the emotion is slightly positive; the emotion value of the URG flag is set to-0.5, and the emotion is slightly negative.

7. The feature generation method according to claim 1, characterized by further comprising:

establishing a machine learning model based on the comprehensive feature set, the model being trained to distinguish between normal network traffic behavior and abnormal network traffic behavior;

8. A feature generation apparatus for network traffic data, comprising:

a scanning unit, configured to set a sliding window, where the sliding window is used to scan the two-dimensional data sequence along a time axis; respectively calculating statistical information of the data points and emotion scores based on TCP marks for the data points in the sliding window, wherein the statistical information comprises an average value of the data packet sizes in the window and the number of the data points; the emotion score is calculated according to emotion values corresponding to the pre-allocated TCP marks;

9. A feature generation apparatus of network traffic data, comprising:

a processor;

a memory for storing a program which, when read by the processor, performs the feature generation method of network traffic data provided in any one of claims 1 to 7.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the method of generating characteristics of network traffic data as provided in any one of claims 1-7.