CN115859277B - Host intrusion detection method based on system call sequence - Google Patents

Host intrusion detection method based on system call sequence Download PDF

Info

Publication number
CN115859277B
CN115859277B CN202310072261.3A CN202310072261A CN115859277B CN 115859277 B CN115859277 B CN 115859277B CN 202310072261 A CN202310072261 A CN 202310072261A CN 115859277 B CN115859277 B CN 115859277B
Authority
CN
China
Prior art keywords
system call
sequence
feature
leaf node
behavior tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310072261.3A
Other languages
Chinese (zh)
Other versions
CN115859277A (en
Inventor
李涛
唐聪
何俊江
兰小龙
方文波
陈姿妤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202310072261.3A priority Critical patent/CN115859277B/en
Publication of CN115859277A publication Critical patent/CN115859277A/en
Application granted granted Critical
Publication of CN115859277B publication Critical patent/CN115859277B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a host intrusion detection method based on a system call sequence, which relates to the technical field of computer security and comprises the following steps: s1: capturing system call information and dividing the system call information into a plurality of system call sequences; s2: defining an abnormal activity track represented by an abnormal sequence; s3: storing mapping relations among features with different granularities; s4: converting the relation mapping diagram into an abstract behavior tree; s5: pruning the abstract behavior tree, and S6: converting the captured system call sequence into a leaf node sequence, and extracting features from the new leaf node sequence; s7: performing feature dimension reduction on the extracted feature vector; s8: and taking the feature vector after dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into two types of abnormality and normal. The method solves the problems of overhigh vector dimension and overlong time consumption generated in the feature extraction process in the prior art, and can reduce the hardware cost required by host deployment.

Description

Host intrusion detection method based on system call sequence
Technical Field
The invention relates to the technical field of computer security, in particular to a host intrusion detection method based on a system call sequence.
Background
Existing intrusion detection systems (Intrusion detection systems, abbreviated IDS) can be categorized into Network-based intrusion detection systems (Network-based Intrusion detection systems, abbreviated NIDS) and Host-based intrusion detection systems (Host-based Intrusion detection systems, abbreviated HIDS). Where NIDS is typically deployed at a backbone network node to identify network intrusion events by detecting network traffic, HIDS is deployed on hosts to monitor various host data, such as logs, directories, files, and registries, to detect and prevent malicious activity. In contrast to NIDS, HIDS has the ability to detect internal attacks and advanced persistent threats (AdvancedPersistent Threat, abbreviated as APT), which can be considered the last line of defense to secure network assets.
Currently, in the aspect of a host intrusion detection system, the intrusion detection system based on machine learning/deep learning is the intrusion detection system with the best performance and the most wide application at present, in order to construct good characteristics for detection, system call information is used as the information of the most primitive and the finest granularity of an operating system, so that a system call sequence also becomes the most widely used and most frequent characteristic of HIDS for constructing an intrusion detection engine.
Feature extraction is a critical task for intrusion detection systems, but since this operation itself is very time consuming, some attacks may be performed before the feature selection/extraction task is completed. At present, typical feature extraction methods in constructing a system call-based intrusion detection engine are an N-gram sliding window, a TF-IDF (terminal Frequency-Inverse Document) and a window Frequency method (combining the N-gram with the TF-IDF), wherein the N-gram scans the whole system call, extracts N continuous system call sequences from the N-gram, and retains sequence information in the execution process of the system call, but does not consider the importance of different extracted features for distinguishing intrusions. In contrast, the TF-IDF method can be used to distinguish the importance of different features, but cannot preserve the order information of the system. Compared with the N-gram and TF-IDF methods, the window frequency method combines the advantages of the N-gram and the TF-IDF, and makes up the defects of the N-gram and the TF-IDF.
The window frequency method flow is shown in fig. 2, and the specific steps are as follows:
A. system call information is captured from the system log and divided into system call sequences S1, S2 and Si of different lengths for subsequent data processing.
B. The system call sequences S1, S2 and Si are marked with normal or abnormal system call sequence labels, so that the construction of a subsequent machine learning intrusion detection engine is facilitated
C. The method comprises the steps of extracting features from the system call by using a window frequency method, converting a system call sequence into feature vectors to be suitable for input of a machine learning intrusion detection engine, wherein a N-gram is used for taking a feature segment with a fixed length of N from the system call sequence (for example, taking a feature segment with a fixed length of 3, then the feature segment of S1 is [4 168 42, 168 42 102,.. 168 168 4, 168 4 240 ]), and then a TF-IDF method is used for giving weights (for example, a weight of '4 168 42' is '0.01045553'), so that different system call sequences can be converted into vector representations suitable for the intrusion detection engine by using the method.
D. The system call sequence vector representation and the corresponding classification labels are sent to a machine learning model for training, and a machine learning engine for intrusion detection can be constructed through training of a large number of system call sequence data.
However, the existing window frequency method directly extracts relevant features from the original system call sequence, and in order to meet the requirements of the detection engine on feature fragments with different lengths, different fixed-length capturing relevant feature fragments need to be set, which causes the number of relevant feature fragments to increase exponentially, and further causes that the dimension of the extracted feature vector is too high and the feature extraction time is too long, and the intrusion detection engine constructed by the window frequency method needs to consume a large amount of storage resources and calculation resources.
Disclosure of Invention
The invention aims to solve the problems of overhigh vector dimension and overlong time consumption generated in the feature extraction process in the prior art, and can reduce the hardware cost required for deploying a host, and provides a host intrusion detection method and device based on a system call sequence.
In a first aspect, the present invention provides a method for intrusion detection based on system calls, comprising a system call feature extraction stage and a leaf node sequence detection stage, wherein
The system call feature extraction stage includes:
s1, capturing system call information, dividing the captured system call information into a plurality of system call sequences, and marking corresponding sequence labels;
s2, defining an abnormal activity track represented by the abnormal sequence through different granularity characteristic characterization modes;
s3, storing the mapping relation between the features with different granularities by using a relation mapping diagram;
s4, converting the relation mapping diagram into an abstract behavior tree;
s5, pruning is carried out on the abstract behavior tree, and the structure of the pruned abstract behavior tree is stored;
the leaf node sequence detection stage comprises the following steps:
s6, mapping leaf nodes through an abstract behavior tree, converting the captured system call sequence into a leaf node sequence, and extracting features from the new leaf node sequence by using a window frequency method;
s7, performing feature dimension reduction on the extracted feature vector;
s8, taking the feature vector after dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into two types of abnormality and normal.
Optionally, in step S2, the granularity characteristic characterization mode includes:
the method comprises the steps of an original system call sequence feature representation mode, a system behavior feature representation mode and a system kernel module feature representation mode; the characteristic particle sizes are fine particle size characterization, low-level coarse particle size characterization and high-level coarse particle size characterization respectively.
Optionally, step S3, storing the mapping relationship between the features with different granularities using a relationship map includes:
the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the features with different granularities is stored through the relation mapping diagram;
optionally, step S4, converting the relational mapping map into an abstract behavior tree includes:
and converting the graph storage mode of the relation map into a tree storage mode, and storing the relation map by using an abstract tree structure.
Optionally, step S5, pruning the abstract behavior tree, and storing the pruned abstract behavior tree structure includes:
and selecting to cut off different leaf nodes each time, measuring the pruning effect through the accuracy of the model, considering that the current abstract behavior tree meets the feature extraction requirement when the accuracy reaches a certain preset threshold, and storing the current abstract behavior tree structure.
Optionally, step S7, performing feature dimension reduction on the extracted feature vector includes:
and carrying out feature dimension reduction on the extracted feature vector through singular value decomposition.
Optionally, step S8, taking the feature vector after the dimension reduction as an input of a machine learning model, and classifying the corresponding leaf node sequence into two types of abnormality and normal, including:
the feature vector after dimension reduction and the classification label are divided into a training set, a testing set and a verification set, wherein the training set is used for training a model and determining parameters, the testing set is used for determining a network structure and adjusting super parameters of the model, the verification set is used for checking generalization capability of the model, different machine learning algorithm models are selected for parameter selection, and cross verification is used for evaluating model effects.
In a second aspect, the present invention provides a system call based intrusion detection apparatus, including a system call feature extraction unit and a leaf node sequence detection unit, wherein
The system call feature extraction unit includes:
the capturing unit is used for capturing system call information, dividing the captured system call information into a plurality of system call sequences and marking corresponding sequence labels;
the granularity unit is used for defining an abnormal activity track represented by the abnormal sequence through different granularity characteristic characterization modes;
a mapping unit, configured to store a mapping relationship between the features with different granularities using a relationship map;
the tree conversion unit is used for converting the relation mapping diagram into an abstract behavior tree;
the pruning unit is used for pruning the abstract behavior tree and storing the pruned abstract behavior tree structure;
the leaf node sequence detection unit includes:
the leaf conversion unit is used for carrying out leaf node mapping through the abstract behavior tree, converting the captured system call sequence into a leaf node sequence, and carrying out feature extraction from the new leaf node sequence by using a window frequency method;
the dimension reduction unit is used for carrying out feature dimension reduction on the extracted feature vector;
and the output unit is used for taking the feature vector after the dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into two types of abnormality and normal.
Optionally, the granularity unit, the granularity characteristic characterization mode includes:
the method comprises the steps of an original system call sequence feature representation mode, a system behavior feature representation mode and a system kernel module feature representation mode; the characteristic particle sizes are fine particle size characterization, low-level coarse particle size characterization and high-level coarse particle size characterization respectively.
Optionally, the mapping unit, configured to store the mapping relationship between the features with different granularities using a relationship map, includes:
the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the features with different granularities is stored through the relation mapping diagram;
optionally, the tree conversion unit for converting the relationship map into an abstract behavior tree includes:
and converting the graph storage mode of the relation map into a tree storage mode, and storing the relation map by using an abstract tree structure.
Optionally, the pruning unit is configured to prune the abstract behavior tree, and the saving the pruned abstract behavior tree structure includes:
and selecting to cut off different leaf nodes each time, measuring the pruning effect through the accuracy of the model, considering that the current abstract behavior tree meets the feature extraction requirement when the accuracy reaches a certain preset threshold, and storing the current abstract behavior tree structure.
Optionally, the dimension reduction unit is configured to perform feature dimension reduction on the extracted feature vector, and includes:
and carrying out feature dimension reduction on the extracted feature vector through singular value decomposition.
Optionally, the output unit is configured to divide the feature vector after the dimension reduction into two types of abnormality and normal according to the corresponding leaf node sequence, and includes:
the feature vector after dimension reduction and the classification label are divided into a training set, a testing set and a verification set, wherein the training set is used for training a model and determining parameters, the testing set is used for determining a network structure and adjusting super parameters of the model, the verification set is used for checking generalization capability of the model, different machine learning algorithm models are selected for parameter selection, and cross verification is used for evaluating model effects.
Compared with the prior art, the technical scheme of the invention has the following advantages:
the invention has the advantages that the number of feature fragments generated by feature extraction is reduced, the dimension of feature vectors generated by the feature fragments is reduced, the time expenditure of feature extraction is reduced, the intrusion detection method constructed by the invention saves the calculation resources and storage resources required by deployment, reduces the minimum hardware requirement on a deployment host, solves the contradiction between cost and performance, and can help customers to realize an efficient and intelligent intrusion detection scheme under the condition of lower configuration.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
FIG. 1 is a flow chart of the invention;
FIG. 2 is a window method implementation flow;
FIG. 3 is a mapping relationship between three granularity characterization modes;
FIG. 4 is a schematic pruning diagram of an abstract behavior tree.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings and examples.
Example 1
As shown in FIG. 1, the present invention provides a method for detecting host intrusion based on a system call sequence, which mainly comprises a system call feature extraction stage and a leaf node sequence detection stage, wherein
The system call feature extraction phase is as follows:
s1, capturing system call information, dividing the captured system call information into a plurality of system call sequences, and marking corresponding sequence labels.
Capturing system call information through a specific system call capturing program in an original system call information capturing stage; the captured system call log information is divided into a plurality of system call sequences, and corresponding category labels are marked.
Different operating systems provide a process tracking system call interface through which a tracking function of a process can be realized. The parent process controls the child process and changes the core mirror image of the child process, including reading and writing data of the child process space. The basic principle of original system call information capture is that after process tracking is used, all signals sent to the tracked child process are forwarded to the parent process, while the child process is blocked. After the father process receives the signal, it can check and modify the stopped sub-process, then make the sub-process continue to run, in this way, it can capture the system call information, then process the system call information into multiple original system call sequences, then make corresponding sequence labels, so as to facilitate the subsequent feature extraction and model training.
S2, defining abnormal activity tracks represented by the abnormal sequences through different granularity characteristic characterization modes.
Optionally, the granularity characteristic characterization mode includes: the method comprises the steps of an original system call sequence feature representation mode, a system behavior feature representation mode and a system kernel module feature representation mode; the characteristic particle sizes are fine particle size characterization, low-level coarse particle size characterization and high-level coarse particle size characterization respectively.
As shown in fig. 2, the feature sizes of the original system call sequence feature representation, the system behavior feature representation and the system kernel module feature representation are respectively fine-granularity representation, low-level coarse-granularity representation and high-level coarse-granularity representation. Defining an abnormal activity track represented by the abnormal sequence through three different granularity characteristic expression modes; the system call sequence comprises a system call sequence, a system call interface, a kernel function sub-module and a kernel module, wherein more than hundred system call numbers (fine granularity) are mapped into seventy system behaviors (low-level coarse granularity) according to behaviors represented by the system call sequence, and seventy system behavior interfaces are mapped into seven kernel function sub-modules (high-level coarse granularity) according to functions of the system behaviors, so that each original system call sequence can be converted into a system behavior sequence and a system kernel module sequence through different granularity characteristic representation modes.
And S3, storing the mapping relation between the features with different granularities by using a relation mapping diagram.
Optionally, storing the mapping between the features of different granularities using a relationship map includes: the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the features with different granularities is stored through the relation mapping diagram.
As shown in FIG. 3, three granularity characterization modes have a specific mapping relation with each other, intrusion behaviors can be detected through an original system call sequence, a system behavior sequence and a system kernel module sequence, and the mapping relation represented by three different granularity characteristics is represented by using a relation mapping diagram, wherein the original system call sequence has a good detection effect, the system behavior sequence has a poor detection effect, the system kernel module sequence has a poor detection effect, the time cost and the performance cost of the three sequences are sequentially reduced, and a multi-granularity mixed sequence is constructed from the three sequences in order to balance the detection effect and the resource cost.
S4, converting the relation mapping graph into an abstract behavior tree.
Optionally, converting the relational map to an abstract behavior tree comprises: and converting the graph storage mode of the relation map into a tree storage mode, and storing the relation map by using an abstract tree structure.
The mapping relation of the relation mapping graph is many-to-one, is the same as the node relation of the tree structure, and the mapping relation (named abstract action tree) is stored through the tree structure so as to facilitate the subsequent adjustment of the tree structure.
S5, pruning is carried out on the abstract behavior tree, and the structure of the pruned abstract behavior tree is stored.
Optionally, pruning the abstract behavior tree, and saving the pruned abstract behavior tree structure includes: and selecting to cut off different leaf nodes each time, measuring the pruning effect through the accuracy of the model, considering that the current abstract behavior tree meets the feature extraction requirement when the accuracy reaches a certain preset threshold, and storing the current abstract behavior tree structure.
As shown in fig. 4, the leaf nodes of the current abstract behavior tree are composed of fine granularity characteristics, pruning operation is performed on the abstract behavior tree, different leaf nodes are pruned each time, after pruning, the leaf nodes of the abstract behavior tree are composed of characteristic representations with different granularity (fine granularity, low-level coarse granularity and high-level coarse granularity), the leaf nodes of the abstract behavior tree are pruned through a plurality of rounds, the leaf nodes of the finally reserved abstract behavior tree are composed of system call nodes, system behavior nodes and system kernel module nodes, each system call number corresponds to a leaf node of the abstract behavior tree, and each system call sequence composed of the system call numbers can be converted into a new leaf node sequence through the tree.
The leaf node sequence detection stage comprises the following steps:
s6, mapping leaf nodes through an abstract behavior tree, converting the captured system call sequence into a leaf node sequence, and extracting features from the new leaf node sequence by using a window frequency method;
compared with the original sequence, the leaf node sequence not only retains the information stored in the original sequence, but also greatly reduces the vector dimension generated in the feature extraction process, and obviously reduces the time cost and the calculation cost.
S7, performing feature dimension reduction on the extracted feature vector.
Optionally, performing feature dimension reduction on the extracted feature vector through singular value decomposition;
the machine learning model has higher requirement on the feature dimension, and the extracted feature vector needs to be subjected to dimension reduction treatment, wherein the singular value dimension reduction method has high speed and good effect, and can reduce the dimension of the extracted feature vector to the formulated dimension.
S8, taking the feature vector after dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into two types of abnormality and normal.
Optionally, the feature vector and the classification label after dimension reduction are divided into a training set, a testing set and a verification set, wherein the training set is used for training a model and determining parameters, the testing set is used for determining a network structure and adjusting super parameters of the model, the verification set is used for checking generalization capability of the model, and finally, the intrusion detection engine with high efficiency and low cost is obtained.
Dividing data into a training set and a testing set, selecting different machine learning algorithm models, selecting parameters, using the cross-validation evaluation model effect to continuously perform parameter tuning, constructing an intrusion detection engine with high accuracy and low cost, and finally deploying the engine on a host to realize intrusion detection on the host.
A system call sequence is used below to illustrate the principles and processes of the present invention for feature fragment reduction.
Original system call sequence T: {5 125 6 53 6 91 4 78 78 78 125 122 192};
if T is taken as the whole corpus of feature extraction, extracting features from the corpus by using a window frequency method; and the fixed length of the characteristic fragments is set to be K, and the number of the generated characteristic fragments is N (characteristic vector dimension)
If k=1, the extracted feature segments are:
[5],[125],[6],[3],[91],[4],[78],[122],[192];N=T(1)->9;
if k=2, the extracted feature segments are:
[5 125],[125 6],[6 5],[5 3],[3 6],[6 91],[91 4],[4 78],[78 78],[78 125],[125 122],[122 192];N=T(2)->12;
if k=3, the extracted feature segments are:
[5 125 6],[125 6 5],[6 5 3],…,[125 122 192];N=T(3)->12;
if k=4, the extracted feature segments are:
[5 125 6 5],[125 6 5 3],[6 5 3 6],…,[78 125 122 192];N=T(4)->12;
if k=5, the extracted feature segments are:
[5 125 6 5 3],[125 6 5 3 6],[6 5 3 6 91],…,[7878 125 122 192];N=T(5)->11;
when k=1-5, the number of feature fragments generated is: t (1-5) =t (1) +t (2) +t (3) +t (4) +t (5) =56;
when the method is adopted, the abstract behavior tree is utilized to convert the original system call sequence into the following leaf node sequence
L:{fs-xattr kernel-sched fs-xattr fs-xattr io fs-xattr kernel-capability io fs-stat fs-stat fs-stat kernel-schedkernel-sched ipc-sem};
At this time, L is taken as a whole corpus of feature extraction, and features are extracted from the corpus by using a window frequency method; setting the fixed length value of the characteristic fragments as K, and setting the number of the generated characteristic fragments as N;
if k=1, the extracted feature segments are:
[fs-xattr],[kernel-sched],[io],[kernel-capability],[fs-stat],[ipc-sem];N=L(1)->6;
if k=2, the extracted feature segments are:
[fs-xattr kernel-sched],[kernel-schedfs-xattr],…,[kernel-sched ipc-sem];N=L(2)->12;
if k=3, the extracted feature segments are:
[fs-xattr kernel-sched fs-xattr],…,[kernel-schedkernel-sched ipc-sem];N=L(3)->12;
if k=4, the extracted feature segments are:
[fs-xattr kernel-sched fs-xattr fs-xattr],…,[fs-xattrfs-xattr kernel-sched ipc-sem];N=L(4)->11;
if k=5, the extracted feature segments are:
[fs-xattr kernel-sched fs-xattr fs-xattr io],…,[fs-statfs-stat kernel-sched kernel-sched ipc-sem];N=L(5)->10;
when k=1-5, the number of feature fragments generated is: l (1-5) =l (1) +l (2) +l (3) +l (4) +l (5) =51;
from the above, it can be seen that L (i) <=t (i) and L (i-j) <=t (i-j), where (0 < i < j; i, j is a positive integer), the above explains the principle of the present invention;
in an actual application scene, the size of the corpus is far larger than that of the corpus, and the corpus is often composed of tens of thousands of pieces of original system call sequence data, so that the advantages of the invention are fully proved along with the increase of the size of the corpus;
therefore, the ADFA-LD data set is used as a corpus, and the advantages of the method for reducing the feature fragments can be evaluated on the corpus;
the feature segments T (k) generated on the ADFA-LD corpus using the original system call sequence and the feature segments L (k) generated by the leaf nodes of the present invention are represented by Table 1 below:
TABLE 1
Figure SMS_1
The above describes the feature segment extraction method of the present invention, and the vectorization method of the feature segment of the present invention is explained next;
defining a corpus composed of a selected plurality of leaf node sequences as
Figure SMS_2
The ith leaf node in the corpus is connected with the nodeThe sequence is defined as
Figure SMS_3
Wherein->
Figure SMS_4
Representing a feature fragment contained by the leaf node; />
Figure SMS_5
A tag (normal sequence or malicious sequence) corresponding to the leaf node sequence, wherein ∈>
Figure SMS_6
Indicating that the sequence is a normal sequence,/->
Figure SMS_7
Indicating that the sequence is a malicious sequence; wherein->
Figure SMS_8
Representing the number of leaf node sequences in the corpus.
The present invention uses tf-idf techniques to convert feature fragments into vectors that are used in an input format suitable for the various classifier models;
the detailed description of the calculation of tf-idf values for feature fragment terms is as follows:
word frequency
Figure SMS_9
The calculation formula is shown as (1), +.>
Figure SMS_10
The ith characteristic fragment +.>
Figure SMS_11
Of>
Figure SMS_12
Representing the number of times a feature segment bi appears in the whole corpus,/->
Figure SMS_13
Representation ofThe total number of all feature fragments contained in the corpus;
Figure SMS_14
(1)
inverse file frequency
Figure SMS_15
The calculation formula is shown as (2), +.>
Figure SMS_16
Representing the i-th characteristic fragment->
Figure SMS_17
Is the inverse of the file frequency of>
Figure SMS_18
Representing the sum of all leaf node sequences contained in the corpus, +.>
Figure SMS_19
Indicating that the characteristic fragment +.>
Figure SMS_20
Leaf node sequence number of->
Figure SMS_21
To avoid the case where the denominator is 0 (when the feature segments in the test set do not appear in the expected library of training sets);
Figure SMS_22
(2)
therefore, the calculation formula of tf-idf is shown as (3), and the characteristic fragment
Figure SMS_23
Is equal to the word frequency +.>
Figure SMS_24
Frequency +.>
Figure SMS_25
Is multiplied by +.>
Figure SMS_26
The tf-idf value of (2) is defined as +.>
Figure SMS_27
Then for a leaf node sequence comprising a plurality of characteristic fragments +.>
Figure SMS_28
Conversion to vector representations
Figure SMS_29
Figure SMS_30
In general, the transformed feature vector may have a higher dimension when the fixed length value of the feature segment is set to be larger or it is desired to include a plurality of feature segments of different lengths.
In order to reduce the dimension of the feature vector faster, the invention adopts the SVD method to reduce the dimension, because the SVD method has higher calculation efficiency than the principal component calculation method.
And then taking the feature vectors after dimension reduction as the input of various machine learning models (four machine learning classification models), and finally, dividing the corresponding leaf node sequences into two types of abnormality and normal.
Compared with the prior art that related features are directly extracted from an original system call sequence by a window frequency method, the method maps the original system call sequence into the leaf node sequence, and then extracts feature fragments on the leaf node sequence, so that the speed of increasing the number of the feature fragments is obviously slowed down, the time consumption of feature extraction is reduced, and in addition, the accuracy of the built intrusion detection engine is improved to a certain extent; performing performance evaluation on the data set ADFA-LD, when the fixed length n of the characteristic fragments extracted by the window method is set to be 3, the number of the characteristic fragments generated by the leaf node sequences and the original system call sequence is 18316 and 8632 respectively, compared with the system call sequence, the number of the characteristic fragments is reduced by 112.19%, when the length of the characteristic fragments extracted by the window method is set to be 1-5 (the number of all the characteristic fragments with the fixed length being 1 to 5), the number of the characteristic fragments generated by the leaf node sequences and the original system call sequence is 135485 and 160035 respectively, compared with the system call sequence, the number of the characteristic fragments generated by the leaf node sequences is reduced by 15.34%, the performance of the intrusion detection engine constructed by using a machine learning model such as SVM is comprehensively evaluated by using four indexes of precision rate, recall rate, F1 score and false alarm rate, the fixed length is set to be 1 to 5 respectively, compared with 20 index values generated by the detection engine constructed by the prior art, wherein 16 values are dominant values, and furthermore, the average time of the characteristic extraction is reduced by 1.02s, 6.0256 s, 32 140.43s and 3723 s.
Compared with the prior art, the method has the advantages that the number of feature fragments generated by feature extraction is reduced, the dimension of feature vectors generated by the feature fragments is reduced, the time cost of feature extraction is reduced, the calculated resources and the storage resources required by deployment are saved, the minimum hardware requirement on a deployment host is reduced, the contradiction between cost and performance is solved, and a customer can be helped to realize an efficient and intelligent intrusion detection scheme under a lower configuration condition.
The foregoing has shown and described the basic principles and main features of the present invention and the advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims.

Claims (5)

1. A method for intrusion detection based on system call is characterized by comprising the following steps,
1) A system calling feature extraction stage;
s1: capturing system call information, dividing the captured system call information into a plurality of system call sequences, and marking corresponding sequence labels;
s2: defining abnormal activity tracks represented by the abnormal sequences through different granularity characteristic characterization modes;
the granularity characteristic characterization mode comprises the following steps:
the method comprises the steps of an original system call sequence feature representation mode, a system behavior feature representation mode and a system kernel module feature representation mode; the characteristic granularity is fine granularity representation, low-level coarse granularity representation and high-level coarse granularity representation respectively;
s3: storing the mapping relation between the features with different granularities by using a relation mapping diagram, wherein the mapping relation comprises the following steps:
the mapping relation between the original system call and the system behavior is many-to-one, the mapping relation between the system behavior and the system kernel module is many-to-one, and the mapping relation between the features with different granularities is stored through the relation mapping diagram;
s4: converting the relation mapping diagram into an abstract behavior tree;
s5: pruning is carried out on the abstract behavior tree, and the structure of the abstract behavior tree after pruning is stored;
2) A leaf node sequence detection stage;
s6: mapping leaf nodes through an abstract behavior tree, converting a captured system call sequence into a leaf node sequence, and extracting features from a new leaf node sequence by using a window frequency method;
s7: performing feature dimension reduction on the extracted feature vector;
s8: and taking the feature vector after dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into two types of abnormality and normal.
2. The method of intrusion detection based on system call according to claim 1, wherein step S4: converting the relationship map into an abstract behavior tree, comprising:
and converting the graph storage mode of the relation map into a tree storage mode, and storing the relation map by using an abstract tree structure.
3. The method of intrusion detection based on system call according to claim 1, wherein step S5: pruning the abstract behavior tree, and storing the pruned abstract behavior tree structure, wherein the pruning method comprises the following steps:
and selecting to cut off different leaf nodes each time, measuring the pruning effect through the accuracy of the model, considering that the current abstract behavior tree meets the feature extraction requirement when the accuracy reaches a certain preset threshold, and storing the current abstract behavior tree structure.
4. The method of intrusion detection based on system call according to claim 1, wherein step S7: performing feature dimension reduction on the extracted feature vector, including:
and carrying out feature dimension reduction on the extracted feature vector through singular value decomposition.
5. The method of intrusion detection based on system call according to claim 1, wherein step S8: taking the feature vector after dimension reduction as the input of a machine learning model, and dividing the corresponding leaf node sequence into two types of abnormality and normal, wherein the method comprises the following steps:
the feature vector and the classification label after dimension reduction are divided into a training set, a testing set and a verification set, wherein the training set is used for training a model and determining parameters, the testing set is used for determining a network structure and adjusting super parameters of the model, the verification set is used for checking generalization capability of the model, different machine learning algorithm models are selected for parameter selection, and cross verification is used for evaluating model effects.
CN202310072261.3A 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence Active CN115859277B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310072261.3A CN115859277B (en) 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310072261.3A CN115859277B (en) 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence

Publications (2)

Publication Number Publication Date
CN115859277A CN115859277A (en) 2023-03-28
CN115859277B true CN115859277B (en) 2023-05-02

Family

ID=85657673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310072261.3A Active CN115859277B (en) 2023-02-07 2023-02-07 Host intrusion detection method based on system call sequence

Country Status (1)

Country Link
CN (1) CN115859277B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298381A (en) * 2019-05-24 2019-10-01 中山大学 A kind of cloud security service functional tree Network Intrusion Detection System
CN115278752A (en) * 2022-06-10 2022-11-01 广州大学 AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7162741B2 (en) * 2001-07-30 2007-01-09 The Trustees Of Columbia University In The City Of New York System and methods for intrusion detection with dynamic window sizes
CN102546638B (en) * 2012-01-12 2014-07-09 冶金自动化研究设计院 Scene-based hybrid invasion detection method and system
CN110597735B (en) * 2019-09-25 2021-03-05 北京航空航天大学 Software defect prediction method for open-source software defect feature deep learning
CN111563234A (en) * 2020-04-23 2020-08-21 华南理工大学 Feature extraction method of system call data in host anomaly detection
CN112134862B (en) * 2020-09-11 2023-09-08 国网电力科学研究院有限公司 Coarse-fine granularity hybrid network anomaly detection method and device based on machine learning
CN111931175B (en) * 2020-09-23 2020-12-25 四川大学 Industrial control system intrusion detection method based on small sample learning
CN112613032B (en) * 2020-12-15 2024-03-26 中国科学院信息工程研究所 Host intrusion detection method and device based on system call sequence
CN113094713B (en) * 2021-06-09 2021-08-13 四川大学 Self-adaptive host intrusion detection sequence feature extraction method and system
CN114816909B (en) * 2022-04-13 2024-03-26 北京计算机技术及应用研究所 Real-time log detection early warning method and system based on machine learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298381A (en) * 2019-05-24 2019-10-01 中山大学 A kind of cloud security service functional tree Network Intrusion Detection System
CN115278752A (en) * 2022-06-10 2022-11-01 广州大学 AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system

Also Published As

Publication number Publication date
CN115859277A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
US20210256127A1 (en) System and method for automated machine-learning, zero-day malware detection
US20240129327A1 (en) Context informed abnormal endpoint behavior detection
US9665713B2 (en) System and method for automated machine-learning, zero-day malware detection
US10474818B1 (en) Methods and devices for detection of malware
US5675711A (en) Adaptive statistical regression and classification of data strings, with application to the generic detection of computer viruses
KR20090023613A (en) Visual and multi-dimensional search
WO2022180613A1 (en) Global iterative clustering algorithm to model entities&#39; behaviors and detect anomalies
CN112839014A (en) Method, system, device and medium for establishing model for identifying abnormal visitor
CN110765261A (en) Method, device, server and storage medium for monitoring potential patent disputes
KR102437278B1 (en) Document malware detection device and method combining machine learning and signature matching
CN115859277B (en) Host intrusion detection method based on system call sequence
Yu et al. A unified malicious documents detection model based on two layers of abstraction
CN112347477A (en) Family variant malicious file mining method and device
CN115470489A (en) Detection model training method, detection method, device and computer readable medium
CN113312619B (en) Malicious process detection method and device based on small sample learning, electronic equipment and storage medium
Kyadige et al. Learning from context: Exploiting and interpreting file path information for better malware detection
Jiang et al. A pyramid stripe pooling-based convolutional neural network for malware detection and classification
CN114398887A (en) Text classification method and device and electronic equipment
CN114662099A (en) AI model-based application malicious behavior detection method and device
CN112312590A (en) Equipment communication protocol identification method and device
CN116611062B (en) Memory malicious process evidence obtaining method and system based on graph convolution network
CN113765852B (en) Data packet detection method, system, storage medium and computing device
KR102471731B1 (en) A method of managing network security for users
KR102655234B1 (en) Method and apparatus for retrieving packet at high-speed
US11973775B1 (en) Monitoring client networks for security threats using recognized machine operations and machine activities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant