CN114254691A

CN114254691A - Multi-channel operation wind control method based on active identification and intelligent monitoring

Info

Publication number: CN114254691A
Application number: CN202111303151.0A
Authority: CN
Inventors: 曹世龙; 蔡颖凯; 王一哲; 付瀚臣; 刘鑫; 穆蓉; 许晶晶; 韩昕檀; 赵千乔
Original assignee: Marketing Service Center Of State Grid Liaoning Electric Power Co ltd; State Grid Corp of China SGCC
Current assignee: Marketing Service Center Of State Grid Liaoning Electric Power Co ltd; State Grid Corp of China SGCC
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-03-29

Abstract

The invention provides a multi-channel operation wind control method based on active identification and intelligent monitoring, which is constructed by a computer: an under-sampling module and an intelligent analysis module. The under-sampling module is responsible for sampling and checking input data and judging whether data abnormity exists or not. And when the data are abnormal, the data are transferred to an intelligent analysis module for verification. And the intelligent analysis module is responsible for rechecking the input data. By adopting the method, a balanced and efficient processing mechanism exists in the manual work and the intelligent work. By carrying out double monitoring on network flow, main flow characteristics and the like, multiple searching is carried out on the premise of not increasing hardware resources and consumption too much, and then suggestive abnormal behaviors and risks are judged, so that a better wind control effect is achieved.

Description

Multi-channel operation wind control method based on active identification and intelligent monitoring

Technical Field

The invention relates to the field of network security, in particular to a multi-channel operation wind control method based on active identification and intelligent monitoring, which aims at network data faking and comprises but is not limited to malicious traffic promotion behaviors, black product group behaviors, activity cheating behaviors, malicious ranking behaviors and the like.

Background

With the rapid development of the network, the network citizens have increasingly diversified behaviors of malicious modification, attack and tampering of browsed information, and the behaviors typically include malicious traffic promotion behaviors, black-generation group-partner behaviors, activity cheating behaviors, malicious ranking behaviors and the like.

These hackers, who hide behind the network, exploit vulnerabilities of the network, exploit cheating, violation or even illegal means, or advance data in a short time, or steal network information of the user, or install malware, or send false shopping, winning, recruiting information, etc.

It can be said that as the function of network equipment is complex, the bandwidth of the network is increased, and the network tools are enriched, the wind control risk for the network is increased day by day, and the hidden, inductive and technical properties thereof make the wind control management difficult, and no matter manual, intelligent or combined prevention and control, the network equipment faces huge pressure.

However, the network security and wind control problem is an unavoidable objective reality problem, and how to stick to and maintain the security order of the internet, improve the quality of network information, and detect false information from a large amount of network resources has very important practical significance.

Disclosure of Invention

Aiming at the problems, the invention provides a multi-channel operation wind control method based on active identification and intelligent monitoring.

By adopting the method, a balanced and efficient processing mechanism exists in the manual work and the intelligent work.

By carrying out double monitoring on network flow, main flow characteristics and the like, multiple searching is carried out on the premise of not increasing hardware resources and consumption too much, and then suggestive abnormal behaviors and risks are judged, so that a better wind control effect is achieved.

The invention specifically comprises the following steps:

the multi-channel operation wind control method based on active identification and intelligent monitoring is established by a computer: an under-sampling module and an intelligent analysis module. The under-sampling module is responsible for sampling and checking input data and judging whether data abnormity exists or not. And when the data are abnormal, the data are transferred to an intelligent analysis module for verification. And the intelligent analysis module is responsible for rechecking the input data.

The undersampling module is used for judging the distance of the appointed nearest neighbor sample point, simultaneously paying attention to poor data effect caused by unbalanced samples, and overcoming the problem of large influence of noise points. The undersampling module specifically comprises: the system comprises a network traffic data acquisition and preprocessing submodule, a supervised learning classification model building submodule, a KNN-based semi-supervised learning label correction submodule and a model updating submodule.

After the intelligent analysis module is activated by the undersampling module, the data is verified according to the following steps:

step 1: data set balancing processing: the number of samples designated as category C1 was k1 and the number of samples designated as category C2 was k2, given in the data set. Each sample in the data set represents a d-dimensional vector, and the data set is balanced by adopting a k-means algorithm to obtain k clustering clusters.

If a sample is closer to the cluster center, the more representative a sample is the characteristic of the cluster. Assuming that Ci cluster contains ni samples, i takes 1, 2, … … k.

The samples nearest ni x k2/k1 to the cluster center wi will be selected from the cluster Ci in proportion. Finally, N samples of the type C1 are obtained, and the number of the samples of the type C1 is balanced with the number of the samples of the type C2.

Step 2: selecting characteristics: firstly, generating a feature subset from an original feature set, then evaluating the feature subset by adopting an evaluation function, finally comparing an evaluation result with a stopping criterion, if the evaluation result meets the stopping criterion, outputting the feature subset and verifying the feature subset, and if not, continuously generating the next feature subset and continuously evaluating the feature subset. The evaluation formula adopted in the step is as follows:

wherein, A is the characteristic, C is the category, H (C) represents the information entropy of the whole classification system, n represents the category number of the classification system, P (Ci) represents the sample proportion with the category Ci, m represents the value number of the characteristic A, and P (C)_i∣A＝A_j) And the probability of belonging to the category Ci under the condition that the characteristic A takes the value of Aj is shown.

And step 3: assuming a set S, a characteristic A and a breakpoint T, dividing a value set S of the characteristic A into two sets S by the breakpoint T₁And S₂. Wherein at S₁Wherein A is less than or equal to T, in S₂A in (A) is greater than T. The weighted information Entropy of feature A, Encopy (A, T; S), can be used to calculate the information Entropy of set S, with the following formula:

namely, the information entropy of the set S divided by the breakpoint T is obtained.

For feature A in subset S₁And S₂And continuously and circularly carrying out discretization treatment to obtain:

IG(A,T；S)＝Entropy(S)-Entropy(A,T；S)

the value of the information gain IG (A, T; S) is now less than the threshold value delta. In the above formula, N represents the number of samples in the set S; k represents the number of categories contained in the set S, k_iIs represented in the set S_iThe number of categories contained in (1).

And 4, step 4: and adopting ant colony optimization to intelligently judge whether the behavior causing data abnormity is a cheating behavior. The cheating behaviors comprise malicious popularization flow behaviors, black product group behaviors and activity cheating behaviors.

In the process of forming the classification rule of the ant algorithm, the ants add the condition items to the classification rule antecedents, and the condition items are added to the rule antecedents and are also selected by the probability selection function P_ijAnd (4) determining.

Wherein eta is_ijRepresenting conditional items term_ij has a formula function value eta_ijThe larger the conditional term_ijThe greater the contribution to the classification system, and therefore the greater the probability that the condition term is selected for addition to the classification rule antecedent. Tau is_ij(t) indicates the condition term during the t-th iteration_ijThe pheromone of (a). a represents the number of features if feature A_iNot used by current ants, then x_iWill be set to 1. Otherwise, x_iWill be set to 0, b_iValue representing ith characteristicAnd (4) the number. term_ijRepresents a condition item A_i＝V_ijWherein A is_iDenotes the ith feature, V_ijThe jth value representing the ith feature. Probability selection function P_ijI.e. the condition term_ijProbability of being selected by ants and added to the classification rule antecedent.

And taking the proportion of the condition items associated with the cheating website in the training set as heuristic information to guide ants to search the optimal condition item combination to construct a classification rule. Conditional term_ijOf the heuristic function eta_ijComprises the following steps:

wherein, | T_ij| represents the condition term_ijFrequency of occurrence in the entire training set, | spam _ class T_ij| represents a conditional item term_ijThe frequency of occurrence in the training set of the cheating websites is in a direct proportional relationship, so that the mosquitoes can preferentially select the condition item with higher association degree with the cheating websites to add into the classification rule antecedent. As the number of iterations increases and the pheromone is updated, ants will gradually find the classification detection rules associated with normal websites.

And 5: and (4) outputting the result obtained in the step (4), and manually and comprehensively judging to determine whether the behavior causing the data exception is an illegal behavior.

Advantageous technical effects

The method has high working efficiency, adopts the combination of the KNN semi-supervised learning algorithm and the ant colony learning algorithm, monitors the fluctuation and the abnormality of data through a small amount of data, verifies and finds out possible occurrence points of the abnormality by using the ant colony optimization algorithm, is not only suitable for a link-based detection technology, but also considers the content-based detection technology, and is suitable for various network risk occasions.

The invention not only considers the efficiency of artificial intelligence, but also attaches importance value of artificial experience. By adopting the method, the accuracy and timeliness of artificial intelligence can be gradually improved through software training.

The method can overcome the learning deviation of artificial intelligence caused by insufficient sample points, and can better improve the universality of equipment through later training.

Drawings

FIG. 1 is a block flow diagram of the present invention.

FIG. 2 is a flow chart of the undersampling module of FIG. 1

Fig. 3 is a flow chart of the intelligent analysis module of fig. 1.

Detailed Description

Technical features of the present invention will now be described in detail with reference to the accompanying drawings.

Referring to fig. 1, the multi-channel operation wind control method based on active identification and intelligent monitoring is constructed by a computer: an under-sampling module and an intelligent analysis module. The under-sampling module is responsible for sampling and checking input data and judging whether data abnormity exists or not. And when the data are abnormal, the data are transferred to an intelligent analysis module for verification. And the intelligent analysis module is responsible for rechecking the input data.

Referring to fig. 2, the undersampling module focuses on poor data effect caused by unbalanced samples while making a decision on the specified nearest neighbor sample point distance, and overcomes the problem of large influence of noise points.

The undersampling module specifically comprises: the system comprises a network traffic data acquisition and preprocessing submodule, a supervised learning classification model building submodule, a KNN-based semi-supervised learning label correction submodule and a model updating submodule.

Furthermore, the network traffic data acquisition and preprocessing submodule is responsible for acquiring the network traffic data. The main flow characteristics of the aforementioned network traffic data should include: source IP address, protocol type, number of bytes, etc.

The submodule is also responsible for normalizing the network traffic data,

where x is the sample attribute value, x_minIs the minimum value of the property, x_maxIs the maximum value of the property, x_scaleIs the normalized data. The above data subjected to the normalization processing is referred to as "tag data".

Furthermore, in a supervised learning-based classification model building submodule, label data is selected as training data, and then a proper classification model is selected for training. The data set is divided in the model training process, part of data is randomly extracted for verification, and an initial classification model is trained by using a cross-validation method in the model selection process, wherein the initial classification model is the optimal network flow. The set of "initial classification models" is the "initial classification dataset". The accuracy of the classification model is improved by the steps.

Furthermore, the KNN-based semi-supervised learning label correction submodule is responsible for detecting whether the detected abnormal flow data reaches a specified threshold value or not in a low-loss manner on the premise of not increasing the amount of human participation: if the threshold value is not reached, the monitoring is continued. Otherwise, the data label correction operation is executed. The method specifically comprises the following steps:

1) a portion of the "tag data" is scaled for manual tagging and then placed back into the dataset.

2) Firstly, extracting data in a data set, adding the importance of flow characteristics into a KNN decision, and calculating the weighted Euclidean distance between samples. Then, the classification of the data set is completed using the Self-tracing method. And finally, selecting the optimal neighbor number K by using ten-fold cross validation. The "corrected classification data" is obtained. The aggregation of the obtained "corrected classification data" is the obtained "corrected classification data set".

Further, the model update submodule operates according to the following steps:

1) the initial classification data set and the corrected classification data set are compared, after the places with different data classification marks are found out, the places are manually verified and input into the initial classification data set, and the classification data set is called as the classification data set after being manually verified and updated.

2) And training the data set in the classification data set after the artificial verification is updated to obtain a classification model after the artificial verification is updated.

3) Comparing the classification progress of the 'initial classification model' and the 'classification model after artificial verification and update':

if the former progress is higher than the latter, the proportion of the manually marked data is increased.

Otherwise, the 'initial classification model' is abandoned, and the 'classification model after the artificial verification updating' is adopted.

Referring to fig. 3, after the intelligent analysis module is activated by the undersampling module, the data is verified according to the following steps:

wherein A is a feature, C is a class, H (C)Representing the information entropy of the whole classification system, n representing the number of classes of the classification system, P (Ci) representing the sample proportion of the class Ci, m representing the number of values of the characteristic A, and P (C)_i∣A＝A_j) And the probability of belonging to the category Ci under the condition that the characteristic A takes the value of Aj is shown.

IG(A,T；S)＝Entropy(S)-Entropy(A,T；S)

The classification rule of ant algorithm is formedIn the course of course, ant adds condition item to the front piece of classification rule, and the condition item is added to the front piece of rule by probability selection function P_ijAnd (4) determining.

Wherein eta is_ijRepresenting conditional items term_ij has a formula function value eta_ijThe larger the conditional term_ijThe greater the contribution to the classification system, and therefore the greater the probability that the condition term is selected for addition to the classification rule antecedent. Tau is_ij(t) indicates the condition term during the t-th iteration_ijThe pheromone of (a). a represents the number of features if feature A_iNot used by current ants, then x_iWill be set to 1. Otherwise, x_iWill be set to 0, b_iAnd the value number of the ith characteristic is represented. term_ijRepresents a condition item A_i＝V_ijWherein A is_iDenotes the ith feature, V_ijThe jth value representing the ith feature. Probability selection function P_ijI.e. the condition term_ijProbability of being selected by ants and added to the classification rule antecedent.

wherein, | T_ij| represents the condition term_ijFrequency of occurrence in the entire training set, | spam _ class T_ij| represents a conditional item term_ijThe frequency of occurrence in the training set of the cheating websites is in a direct proportional relationship, so that the mosquitoes can preferentially select the condition item with higher association degree with the cheating websites to add into the classification rule antecedent. With increasing number of iterations and pheromonesMore recently, ants will gradually discover the classification detection rules associated with normal web sites.

Claims

1. The multi-channel operation wind control method based on active identification and intelligent monitoring is characterized by comprising the following steps: constructing by a computer: the system comprises an undersampling module and an intelligent analysis module; the under-sampling module is responsible for sampling and checking input data and judging whether data abnormity exists or not; when data are found to be abnormal, the data are transferred to an intelligent analysis module for verification; and the intelligent analysis module is responsible for rechecking the input data.

2. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 1, wherein: the undersampling module is used for judging the distance of the appointed nearest neighbor sample point, simultaneously paying attention to poor data effect caused by unbalanced samples, and overcoming the problem of large influence of noise points.

3. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 2, wherein: the network flow data acquisition and preprocessing submodule is responsible for acquiring network flow data; the main flow characteristics of the aforementioned network traffic data should include: source IP address, protocol type, byte number and other characteristics;

the submodule is also responsible for carrying out normalization processing on the network flow data and recording x_scaleThe data is normalized; the above data subjected to the normalization processing is referred to as "tag data".

4. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 2, wherein: in a supervised learning-based classification model building submodule, selecting label data as training data, and then selecting a proper classification model for training; the data set is divided in the model training process, part of data is randomly extracted for verification, and the model is trained to be an initial classification model by using a cross-validation method in the selection process.

5. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 2, wherein: the KNN-based semi-supervised learning label correction submodule is responsible for detecting whether the detected abnormal flow data reaches a specified threshold value or not in a low-loss manner on the premise of not increasing the amount of artificial participation: if the threshold value is not reached, continuing monitoring; otherwise, the data label correction operation is executed.

6. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 1, wherein: after the energy analysis module is activated by the undersampling module, the following steps are carried out in sequence: data set balancing processing; selecting characteristics; carrying out discretization treatment; carrying out intelligent identification by adopting an ant colony optimization method; and outputting the result.

7. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 6, wherein: the specific steps of the step 1 are as follows: data set balancing processing: the number of samples designated as class C1 was k1 and the number of samples designated as class C2 was k2, provided in the dataset; each sample in the data set represents a d-dimensional vector, and the data set is balanced by adopting a k-means algorithm to obtain k clustering clusters;

if a sample is closer to the cluster center, the more representative the sample is of the characteristics of the cluster; assuming that Ci cluster contains ni samples, i takes 1, 2, … … k;

selecting ni x k2/k1 samples nearest to the cluster center wi from the clusters Ci according to the proportion; finally, N samples of the type C1 are obtained, and the number of the samples of the type C1 is balanced with the number of the samples of the type C2.

8. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 6, wherein: the specific steps of the feature selection in the step 2 are as follows: firstly, generating a feature subset from an original feature set, then evaluating the feature subset by adopting an evaluation function, finally comparing an evaluation result with a stopping criterion, if the evaluation result meets the stopping criterion, outputting the feature subset and verifying the feature subset, and if not, continuously generating the next feature subset and continuously evaluating the feature subset; the evaluation formula adopted in the step is as follows:

9. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 6, wherein: the specific steps of the step 3 are as follows: assuming a set S, a characteristic A and a breakpoint T, dividing a value set S of the characteristic A into two sets S by the breakpoint T₁And S₂(ii) a Wherein at S₁Wherein A is less than or equal to T, in S₂The value of A in the (A) is more than T; the weighted information Entropy of feature A, Encopy (A, T; S), can be used to calculate the information Entropy of set S, with the following formula:

namely, acquiring the information entropy of the set S divided by the breakpoint T;

IG(A，T；S)＝Entropy(S)-Entropy(A，T；S)

when the value of the information gain IG (A, T; S) is smaller than the threshold value delta; in the above formula, N represents the number of samples in the set S; k represents the number of categories contained in the set S, k_iIs represented in the set S_iThe number of categories contained in (1).

10. The multi-channel operation wind control method based on active identification and intelligent monitoring as claimed in claim 6, wherein: the specific steps of the step 4 are as follows: adopting ant colony optimization to intelligently judge whether the behavior causing data abnormity is a cheating behavior; the cheating behaviors comprise malicious popularization flow behaviors, black product group behaviors and activity cheating behaviors;

in the process of forming the classification rule of the ant algorithm, the ants add the condition items to the classification rule antecedents, and the condition items are added to the rule antecedents and are also selected by the probability selection function P_ijDetermining;

probability selection function P_ijI.e. the condition term_ijProbability of being selected by ants and added to the classification rule antecedents;

the proportion of condition items associated with the cheating website in the training set is used as heuristic information to guide ants to search for the optimal condition item combination to construct a classification rule; conditional term_ijOf the heuristic function eta_ijComprises the following steps:

wherein, | T_ij| represents the condition term_ijFrequency of occurrence, | spam _ class T, in the entire training set_ij| represents the condition term_ijThe frequency of occurrence in the training set of the cheating websites is in a direct proportional relationship, so that the mosquitoes can preferentially select the condition item with higher association degree with the cheating websites to add into the classification rule antecedent; as the number of iterations increases and the pheromone is updated, ants will gradually find the classification detection rules associated with normal websites.