CN116776331A

CN116776331A - Internal threat detection method and device based on user behavior modeling

Info

Publication number: CN116776331A
Application number: CN202310901880.9A
Authority: CN
Inventors: 殷丽华; 李超; 符宇晗; 田志宏; 程朝辉; 方滨兴
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-09-19

Abstract

The embodiment of the specification provides an internal threat detection method and device based on user behavior modeling, wherein the method comprises the following steps: collecting log data through a log collecting module, marking the log data and sending the log data to a log analyzing module; the method comprises the steps of converting log data into serialized log keys through a log analysis module, extracting features of the log keys, and inputting the extracted log features to an abnormality detection module; and predicting the currently input log features through a strong classifier model formed by combining a plurality of classifiers in the abnormality detection module, outputting a prediction result, and judging whether the log is abnormal or not according to the prediction result. An efficient internal threat detection is enabled.

Description

Internal threat detection method and device based on user behavior modeling

Technical Field

The present document relates to the technical field of internal threat detection and machine learning, and in particular, to a method and apparatus for detecting internal threat based on user behavior modeling.

Background

Internal threats refer to security issues caused by personnel (e.g., employees and trusted partners) having access to corporate networks, systems, and data. Although internal threats are less likely to occur, they may cause more severe damage than external intrusions. This is because internals are generally familiar with the company's computer system and operational flow, and their unexpected or malicious activity may lead to leakage, damage, or even theft of company data.

Since insights are very familiar with the information system and process of a company and possess certain rights, it is difficult to determine when they have malicious activity. For example, they may use their rights to do illegal operations, steal corporate secrets, or reveal sensitive information. At the same time, they may use legal access rights to implicitly collect sensitive information or to implant a backdoor program on the corporate network for later attack or theft of data.

Although many system protection techniques have been developed to prevent intrusion by outside personnel, such as the patterns of connection to the Internet Protocol (IP) and the type of attack, companies still face many challenges in detecting and protecting against internal threats. One strategy is a rule-based internal threat detection system, but its shortcomings include the following:

1) Limitations. Rules are defined based on known internal threat behaviors and attack patterns, and thus new or unknown internal threat behaviors and attack patterns may not be detected. At the same time, rules can only describe specific aspects of behavior, and it is difficult to cover all possible threat behaviors. 2) The false alarm rate is high. Since the rules are statically defined and cannot adapt to dynamic changes in the environment and behavior, false alarms may occur. For example, a certain employee may access sensitive data in a particular situation, but due to rules restrictions the system may missignal such access as an internal threat. 3) Rules need to be defined and maintained manually. Rules are manually defined and maintained, and professional security personnel are required to write and update the rules. This process requires a lot of time and effort and may be subject to omission and errors. At the same time, the management and updating of rules also requires a lot of resources and costs. 4) It is difficult to handle complex threat behaviors. Rule-based internal threat detection systems can only handle relatively simple threat behaviors, but cannot handle more complex threat behaviors and attack patterns. For example, some internal threat behaviors may be commonly implemented by multiple employees, and rules have difficulty describing such complex behavior patterns.

Yet another strategy is to build statistical or machine learning models based on previous data to predict potential malicious behavior. While machine learning has many advantages in terms of user anomaly detection, it also has some drawbacks. 1) Data set bias: machine learning algorithms require a large amount of high quality data to train the model. However, if there is a bias in the data set, for example, some types of attacks occur very infrequently in the data set, then the machine learning algorithm may not accurately identify these attacks. Thus, the quality and diversity of the data set is critical to the performance of the machine learning algorithm. 2) High false positive rate: machine learning algorithms may produce false positives, i.e., normal user behavior is falsely marked as abnormal behavior, thereby bringing additional effort to security personnel. While this can be reduced by adjusting the algorithm parameters, this may also lead to an increase in the rate of false negatives.

In general, the problems of the prior art are: 1) Rule-based detection systems must be constantly updated by the knowledge of domain experts. The method consumes manpower and material resources, and the risk of avoiding rules by people always exists. 2) The malicious behavior categories are updated at a high frequency: in modern networking systems, the computational tasks become more complex and variable, which makes the log statements of the various access-generating log messages also updated frequently, which makes modeling of user behavior difficult to generalize. 3) The existing size companies rarely want to share own internal log and user behavior data, resulting in a large imbalance of the existing data. From a modeling perspective, training binary classification algorithms is almost impossible when there are only a few abnormal samples. In the case of such an imbalance-like situation, most statistical/machine learning algorithms tend to classify all activities as normal, which makes it difficult for the internal threat detection model framework to adapt to the imbalance data.

Disclosure of Invention

The invention aims to provide an internal threat detection method and device based on user behavior modeling, and aims to solve the problems in the prior art.

The invention provides an internal threat detection method based on user behavior modeling, which comprises the following steps:

collecting log data through a log collecting module, marking the log data and sending the log data to a log analyzing module;

the method comprises the steps of converting log data into serialized log keys through a log analysis module, extracting features of the log keys, and inputting the extracted log features to an abnormality detection module;

and predicting the currently input log features through a strong classifier model formed by combining a plurality of classifiers in the abnormality detection module, outputting a prediction result, and judging whether the log is abnormal or not according to the prediction result.

The invention provides an internal threat detection device based on user behavior modeling, which comprises:

the log acquisition module is used for acquiring log data, labeling the log data and sending the log data to the log analysis module;

the log analysis module is used for converting log data into serialized log keys, extracting characteristics of the log keys and inputting the extracted log characteristics to the abnormality detection module;

and the abnormality detection module is used for predicting the current input log characteristics through a strong classifier model formed by combining a plurality of classifiers, outputting a prediction result and judging whether the log is abnormal according to the prediction result.

By adopting the embodiment of the invention, various behavior records of the user are continuously collected, a large-scale heterogeneous log data modeling is generated, and the heterogeneous log data is extracted as effective vectorization characteristics. The log data vector is learned and predicted using machine learning for efficient internal threat detection.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description that follow are only some of the embodiments described in the description, from which, for a person skilled in the art, other drawings can be obtained without inventive faculty.

FIG. 1 is a schematic diagram of an internal threat detection method based on user behavior modeling in accordance with an embodiment of the invention;

FIG. 2 is a schematic diagram of a framework of an anomaly detection method based on user behavior modeling in accordance with an embodiment of the present invention;

FIG. 3 is a detailed process flow diagram of a user behavior modeling based internal threat detection method in accordance with an embodiment of the invention;

FIG. 4 is a flow chart of an anomaly detection algorithm of an internal threat detection method based on user behavior modeling in accordance with an embodiment of the invention;

FIG. 5 is a schematic diagram of a specific example of an embodiment of the present invention;

FIG. 6 is a schematic diagram of an internal threat detection apparatus based on user behavior modeling in accordance with an embodiment of the invention.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.

Method embodiment

According to an embodiment of the present invention, there is provided an internal threat detection method based on user behavior modeling, and fig. 1 is a schematic diagram of an internal threat detection method based on user behavior modeling according to an embodiment of the present invention, as shown in fig. 1, where the internal threat detection method based on user behavior modeling according to an embodiment of the present invention specifically includes:

step 101, collecting log data through a log collecting module, marking the log data and sending the log data to a log analyzing module; the log data acquisition by the log acquisition module specifically comprises the following steps:

the log data is collected by a log collection module through a syslog log collection protocol, the log data is stored in a MySQL database in a partition mode according to time or other dimensions, and a table is built for each type of record.

Step 102, converting the log data into serialized log keys through a log analysis module, extracting features of the log keys, and inputting the extracted log features to an abnormality detection module;

the method for extracting the characteristics of the log keys comprises the following steps of:

matching keywords and modes in the log data by using regular expressions, encoding the log data analyzed by a log analysis algorithm into word bags, digitizing the non-numeric log data, expressing text information as a word bag, namely, taking each word as a feature, and counting the occurrence times of each word in the log data or expressing the occurrence times as binary variables of existence or non-existence of each word to obtain the log feature.

The log data is converted into serialized log keys through a log analysis module, and the feature extraction of the log keys specifically comprises the following steps:

matching keywords in the log through a log analysis module by using a regular expression, and extracting log data of a section as dimension characteristics, wherein the dimension characteristics specifically comprise: a log key and a label;

encoding the log key and the label: firstly, word segmentation is carried out on texts, each word is respectively extracted, for each text, the occurrence times of all words in the text are recorded to form a vector, for all texts, all the occurring words are formed into a vocabulary, vector representations of each text are arranged according to the sequence in the vocabulary, a high-dimensional feature vector representation is obtained, and normalization and/or dimension reduction processing is carried out on the feature vectors, so that log features are obtained.

And step 103, predicting the currently input log features through a strong classifier model formed by combining a plurality of classifiers in the abnormality detection module, outputting a prediction result, and judging whether the log is abnormal according to the prediction result.

The technical scheme of the embodiment of the invention further comprises the following steps:

training a strong classifier model formed by combining a plurality of classifiers in an abnormality detection module: acquiring log features, training by using an LSTM (least squares) model, a decision tree, a PCA (principal component analysis) model and a DNN (digital network) model, inputting the prediction results of a plurality of classifiers as new features into another model by using a Stacking integrated learning method, determining the weights of the plurality of classifiers by a final model through cross verification of a training set, and predicting new data by each model in a test stage, wherein the final result is a weighted sum of the prediction results of the models;

judging whether the final result meets the requirement of service safety, if so, determining to complete training of the model, and if not, retraining.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to the accompanying drawings.

The invention continuously collects various behavior records of the user and generates large-scale heterogeneous log data to perform effective internal threat detection. This approach may help security professionals analyze the user's behavior patterns to discover internal threats that may exist early. A frame of an anomaly detection method based on user behavior modeling is shown in fig. 2, a log acquisition module acquires a user log, a log analysis module converts a non-serialized log into serialized data, and model training is carried out. The algorithm flow is formed by integrating a plurality of machine learning models, and finally a predictable integrated model, namely an abnormality detection module is trained. The log collecting module is responsible for collecting logs, storing the logs into the database and sending the logs to the log analyzing module. 2) And the log analysis module converts the log file into a serialized log key to finish feature extraction. 3) The abnormality detection module is a strong classifier model formed by combining a plurality of classifiers and is responsible for outputting the prediction of the characteristics of the current input log and judging whether the log is abnormal or not.

The log acquisition module is used for: the log acquisition module has the main functions as follows:

1.1 Syslog log collection protocol: a syslog log acquisition protocol is used for transmitting system log data. It supports a variety of transport protocols such as UDP, TCP, TLS, and the like.

1.2 MySQL database): partition storage is used: and storing the log data in a partition mode according to time or other dimensions. This approach may allow for faster and more accurate querying and filtering operations, while also optimizing the use of storage space.

2) And a log analysis module: the log analysis module has the main functions as follows:

2.1 Log parsing algorithm): keywords and patterns in the log are matched using regular expressions. For example, regular expressions may be used to search for specific IP addresses, date and time stamps, error messages, etc.

2.2 Feature extraction algorithm): and (3) encoding the log information analyzed by the log analysis algorithm into word bags, and digitizing the non-numerical log information. The text information is represented as a bag of words, i.e. each word is treated as a feature, and the number of occurrences of each word in the log text is counted or represented as a binary variable of presence or absence.

3) An abnormality detection module: the main functions of the abnormality detection module are as follows:

3.1 Training phase):

and receiving log characteristic data, and performing integrated model training by using LSTM, decision tree, PCA and DNN. Using Stacking integrated learning method, the prediction results of multiple classifiers are input as new features into another model, and the final model determines the weights of multiple classifiers through cross-validation of training sets. In the test phase, each model predicts new data, and the final result is a weighted sum of the model predictions.

3.2 A prediction phase):

and receiving the log information of a certain user sent by the log analysis module, and scoring and judging whether the access of the user is normal or not.

The flow of the internal threat detection method based on user behavior modeling in the embodiment of the invention is shown in fig. 3, and the specific steps are as follows:

s301, the syslog log acquisition system collects access records of a system user in a period, including login equipment, HTTP, files, operations and the like of the user, stores the access records into a MySQL database, and builds a table for each type of record, wherein the table contains some abnormal records.

S302, taking the data acquired in S201 as a data set of a sample, marking the data, and sending the data to a log analysis module.

S303, the log analysis module uses the regular expression to match keywords in the log, and extracts log information of a section as dimension characteristics. Consists of a log key and a label. The format is shown in table 1.

S304, coding the log key and the label, firstly segmenting the text, and respectively extracting each word. For each text, the number of times that all words appear in the text is recorded, forming a vector. For all text, all the emerging phrases are combined into a vocabulary (vocabolar). The vector representations of each text are arranged in order in the vocabulary to obtain a high-dimensional feature vector representation. And carrying out normalization or dimension reduction and other treatments on the feature vectors so as to improve the performance and efficiency of the model. The encoded format is shown in table 2.

S305, sending the training results to the integrated model learning, judging whether the training results of the network can be used, providing training accuracy, recall rate, ROC curve and the like after the training is finished, and checking whether the training results reach expectations.

S306, if the requirement of service safety is met, the network can be used for abnormality detection. If the requirement is not met, the training can be retrained, that is, S205 is performed again, and the training iteration number or the data preprocessing mode is increased.

S307, the anomaly detection algorithm generates scoring judgment for the access log of the user A, and judges whether the access log is normal or abnormal of some type.

The flow of an anomaly detection algorithm of the internal threat detection method based on user behavior modeling in the embodiment of the invention is shown in fig. 4, and the specific steps are as follows:

s401, after the log acquisition module and the log analysis module complete acquisition and analysis, log features after analysis are sent to an integration model, and whether the access is safe or not is obtained through the integration model, wherein the integration mode of the integration model is based on Stacking.

S402, data first passes through a first layer LSTM of the integrated model, for LSTM, the training set D is divided into k parts, for each part, the model is trained with the rest of the data set, and then the result of the part is predicted. The above steps are repeated until each of the portions is predicted. A training set of secondary models is obtained. And obtaining k test sets, and obtaining the test set of the secondary model after averaging.

S403, for the following decision tree, PCA and DNN repeat the above conditions to obtain M-dimensional data.

And S404, finally, predicting the data of the trained model, wherein the obtained final result is the data required finally.

The embodiment of the invention adopts the fir tree technical scheme, and can:

reducing the time/labor costs of internal threat detection. Rule-based detection systems require that rules be continually updated to remain valid because new threats and vulnerabilities are continually present as technology and environments are continually changing. To address these new threats, the rules need to be continually updated to maintain the accuracy and validity of the detection. Updating rules requires a lot of manpower and material resources, as a professional domain expert is required to design, adjust and test the rules. These experts take a significant amount of time to collect and analyze the relevant data and information to determine which rules need to be updated and improved. In addition, they need to perform extensive testing and verification to ensure that updated rules accurately detect new threats. Furthermore, rule-based detection systems also require constant monitoring and maintenance to ensure the accuracy and validity of rules. This requires extensive effort and resources to monitor and analyze the data and to update rules in time to deal with new threats. In summary, rule-based detection systems require constantly updating and maintaining rules, which require a lot of manpower and material resources, and may be vulnerable or incomplete, so the risk of circumventing rules is always present.

And the accuracy of the malicious behavior prediction model is improved. The invention models the historical behavior of the user, i.e. collects and analyzes user logs. The anomaly detection using user modeling can effectively detect unknown or anomalous behavior because it models and detects based on the user's historical behavior patterns, and can adapt to changes and evolution of the user's behavior. Meanwhile, the system also has certain self-adaptive capacity, and can automatically learn and update the normal behavior mode of the user so as to adapt to the continuously-changing environment and threat. Since the machine learning method can constantly learn and update algorithms from data, the machine learning method can perform stable and accurate detection as compared to rule-based detection

The malicious behavior prediction model may be adapted to the unbalanced data set. The invention uses ensemble learning, which is a method of combining multiple classifiers into one strong classifier. Ensemble learning may be used to solve the problem of unbalanced data sets. For example, a Bagging or Boosting-based ensemble learning method may be used to train multiple classifiers separately and combine them into one strong classifier. The method can improve the prediction accuracy of the model and solve the problem of unbalanced data sets.

The following is an illustration.

The following detailed description of the invention is further detailed in connection with the following examples, which are provided to illustrate the invention and not to limit the scope of the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention, and all such equivalent technical solutions are therefore within the scope of the present invention. As shown in fig. 5, the implementation of the present invention is as follows.

1. User a generates a log: in the daily business implementation process, the log mining acquisition module collects activity data logs of the user A with the ID { R3I7-S4TX96FG-8219 JFWFF } including the ID, date, login equipment and the like of the user. The behavior information is extracted to 60 key features and stored in a standardized data table as an activity abstract of the user so as to facilitate the feature extraction of the data. The data format is shown in table 1.

Table 1 log data integrated into standardized data tables after acquisition

2. Analysis log: the log parsing module encodes and normalizes data integrated into a standardized data table. The character string is converted into a form that the model can handle. The encoded and normalized data format is shown in table 2.

Table 2 encoded and normalized log data

2.000	2.166	1.900	1.000	4.000	5.000	2.900	...
								1.000	1.300	2.200	0.000	0.000	4.100	6.00	...

3. Uploading to an abnormality detection module: and sending the encoded and normalized log data to an anomaly detection system, and processing by an algorithm.

4. Classification of integrated model: and receiving data by the multi-layer integrated model, and outputting a prediction result after the multi-layer model predicts.

5. Obtaining a prediction result: and obtaining the final prediction result of the model, wherein the access log of the user A is classified as abnormal behavior X.

Compared with the use rule and manual strategy, the embodiment of the invention reduces the labor cost by automatic log mining and modeling. The accuracy of the malicious behavior prediction model is improved, and since the ensemble learning method can continuously learn and update algorithms from data, the machine learning method can perform stable and accurate detection compared to rule-based detection. The malicious behavior prediction model may be adapted to the unbalanced data set. The method of the invention learns the normal behavior of the user, and if the predicted normal behavior does not accord with the actual behavior, the actual behavior is abnormal. All data are trained by comparison, most of which are normal data. The ensemble learning used in the present invention is a method of combining a plurality of classifiers into one strong classifier. The method can improve the prediction accuracy of the model and solve the problem of unbalanced data sets.

Device embodiment

According to an embodiment of the present invention, there is provided an internal threat detection apparatus based on user behavior modeling, and fig. 6 is a schematic diagram of the internal threat detection apparatus based on user behavior modeling according to the embodiment of the present invention, as shown in fig. 6, where the internal threat detection apparatus based on user behavior modeling according to the embodiment of the present invention specifically includes:

the log collection module 60 is configured to collect log data, label the log data, and send the log data to the log analysis module;

the log parsing module 62 is configured to convert the log data into serialized log keys, perform feature extraction on the log keys, and input the extracted log features to the anomaly detection module; the log collection module 62 is specifically configured to:

The log parsing module 62 is specifically configured to:

matching keywords in the log by using a regular expression, and extracting log data of a section as dimension characteristics, wherein the dimension characteristics specifically comprise: a log key and a label;

The anomaly detection module 64 is configured to predict a currently input log feature by using a strong classifier model formed by combining a plurality of classifiers, output a prediction result, and determine whether the log is anomalous according to the prediction result.

The anomaly detection module 64 is further configured to:

training a strong classifier model formed by combining a plurality of classifiers: acquiring log features, training by using an LSTM (least squares) model, a decision tree, a PCA (principal component analysis) model and a DNN (digital network) model, inputting the prediction results of a plurality of classifiers as new features into another model by using a Stacking integrated learning method, determining the weights of the plurality of classifiers by a final model through cross verification of a training set, and predicting new data by each model in a test stage, wherein the final result is a weighted sum of the prediction results of the models;

The embodiment of the present invention is an embodiment of a device corresponding to the embodiment of the method, and specific operations of each module may be understood by referring to descriptions of the embodiment of the method, which are not repeated herein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. An internal threat detection method based on user behavior modeling, comprising:

2. The method according to claim 1, wherein the collecting of the log data by the log collecting module specifically comprises:

3. The method according to claim 1, wherein the log data is converted into serialized log keys by a log parsing module, and the feature extraction of the log keys specifically comprises:

4. A method according to claim 3, wherein the log data is converted into serialized log keys by a log parsing module, and the feature extraction of the log keys specifically comprises:

5. The method according to claim 1, wherein the method further comprises:

6. An internal threat detection apparatus based on user behavior modeling, comprising:

7. The apparatus of claim 6, wherein the log collection module is specifically configured to:

8. The apparatus of claim 6, wherein the log parsing module is specifically configured to:

9. The apparatus of claim 8, wherein the log parsing module is specifically configured to:

10. The apparatus of claim 6, wherein the anomaly detection module is further to: