CN113378156A - Malicious file detection method and system based on API - Google Patents
Malicious file detection method and system based on API Download PDFInfo
- Publication number
- CN113378156A CN113378156A CN202110749396.XA CN202110749396A CN113378156A CN 113378156 A CN113378156 A CN 113378156A CN 202110749396 A CN202110749396 A CN 202110749396A CN 113378156 A CN113378156 A CN 113378156A
- Authority
- CN
- China
- Prior art keywords
- file
- api
- word
- files
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/52—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
- G06F21/53—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/448—Execution paradigms, e.g. implementations of programming paradigms
- G06F9/4488—Object-oriented
- G06F9/449—Object-oriented method invocation or resolution
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Hardware Design (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention provides a malicious file detection method and a system based on an API (application program interface), wherein the method comprises the following steps: putting the file into a sandbox for operation, and simultaneously recording the API name and tid called during the operation of the file and the sequence number index called by the API in the thread; data preprocessing, comprising: processing an API in data, optimizing the low-frequency API, generating a new field and mapping label codes; constructing a feature project based on the processed data, wherein the feature project comprises global features and local combination features, and the two feature sets are finally spliced into a feature set; correcting files which cannot be judged by part of antivirus software into normal records according to the initial training result of the model, and further training the model again; and (5) model prediction. The invention also provides a malicious file detection system based on the API. The method has a certain recognition rate for various malicious files bypassing the feature codes and sandbox detection, and can improve the generalization capability of malicious file detection.
Description
Technical Field
The invention relates to the technical field of information security services, in particular to a malicious file detection method and system based on an API.
Background
In recent years, along with the development of computer technology, intelligent terminals and network technology are widely applied, and malicious files are spread and mutated to some extent. Legal files are used for strengthening and expanding the capability of a computer, so that the work and the life of people are facilitated; the malicious files are intended to steal or destroy computer data, etc., which may further bring economic loss and mental trouble to enterprises and individuals. Therefore, malicious files are detected in time, threats brought by the malicious files are blocked, and the maintenance of the health and the safety of the network environment is more and more important.
The current detection method of the malicious file mainly comprises a feature code method, a sandbox detection technology and the like. The method is characterized in that a common method is a feature code method, a virus feature code library needs to be established and maintained, whether a file belongs to a malicious file is detected by inquiring whether the file contains the feature codes, the method is high in detection speed, but the malicious file containing unknown feature codes cannot be detected, and once the malicious file is detected by means of deformation, encryption, shell adding and the like, the feature codes can be avoided. In recent years, sandbox technology is more and more widely used, and the method simulates a normal environment to run unknown files, records file running actions, and matches the actions with a malicious file library to judge whether the unknown files belong to malicious files. With the popularization of machine learning application, some learning methods for constructing machine learning models to detect malicious files also appear. For example, patent document No. 202010572487.6 discloses a method for constructing a detection model of a malicious file and detecting the malicious file, and a plurality of normal samples and a plurality of malicious samples are obtained and labeled respectively; filtering out the malicious samples without shells in the malicious samples; establishing a static model, comprising: obtaining PE formats of a plurality of normal samples and a plurality of malicious samples; converting the data into a plurality of characteristic vectors according to the obtained PE format of each sample; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters; inputting the feature vector associated with the label into a random forest model and a LightGBM model, and respectively establishing the random forest model and the LightGBM model for statically detecting the malicious file; establishing a dynamic model, comprising: putting the normal samples and the malicious samples into a sandbox to obtain a sandbox report, and acquiring feature vectors of the samples in the sandbox report, wherein the feature vectors relate to an API (application program interface), a tid (endpoint), a return _ value and an index; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters, and establishing an important characteristic random forest model; inputting the feature vectors associated with the labels into a random forest model, an important feature random forest model and a LightGBM model, and respectively establishing the random forest model, the important feature random forest model and the LightGBM model for dynamically detecting malicious files; fusing all the static models and all the dynamic models to obtain a fused model; and calculating to obtain a final malicious score according to the total malicious suspicious score obtained by the fusion model and the malicious suspicious score obtained by the malheur model, and detecting the sample according to the final malicious score.
Although the sandbox detection technology avoids the defect that the characteristic code method cannot detect unknown malicious files to a certain extent, attackers are also looking for various methods to bypass the detection of the sandbox, such as detecting system characteristics, delaying operations and the like, so that whether part of unknown files belong to malicious files cannot be judged in the sandbox.
Although the existing method for modeling by utilizing machine learning solves the problem that a sandbox detection technology is bypassed to a certain extent, the characteristic engineering of the method is biased to statistic characteristics, characteristic differences of different files in API calling time sequences are ignored, and characteristic sparsity is ignored during characteristic processing, so that the accuracy and efficiency of a model are possibly reduced. APIAPIAPIAPI
Disclosure of Invention
The technical problem to be solved by the invention is how to judge whether the unknown file belongs to a malicious file.
The invention solves the technical problems through the following technical means: a malicious file detection method based on API comprises the following steps:
s101, classifying collected files to confirm file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
s102, preprocessing is carried out based on parameters called when a known file runs to serve as model training data;
s103, constructing a feature engineering set based on the preprocessed data, wherein the feature engineering set comprises: global features and local combined features;
s104, constructing a model, and correcting the model based on the characteristic engineering set and a preset threshold value;
and S105, detecting the collected unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
The method mainly aims at the processing of the API, extracts keywords in the API and constructs the characteristics, so that the characteristic dimension and the characteristic sparsity are reduced, and the efficiency and the accuracy of the model are improved; moreover, through model detection, probability values of unknown files belonging to various categories can be output, and the possibility that the unknown files belong to malicious files, namely scores, is quantized; in addition, under the condition that no 'normal' label file exists, a pseudo label 'normal' data set is generated by using the model, and a multi-classification model with certain recognition capability on the 'normal' file is trained, so that whether the unknown file is a malicious file or not and the classification of the unknown file are predicted.
As an optimized technical solution, in the step S102, the step of preprocessing based on the parameter called by the known file runtime includes:
segmenting the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;
merging the API names based on the contents of the first column to reduce feature dimensions;
optimizing the API based on the number of the files corresponding to the first column;
generating a new field based on the thread ID and the sequence number of the API call in the thread;
and converting the file type into a numerical value, and finishing label coding mapping. As an optimized technical solution, the step of segmenting the API name to obtain a first word and a second word, and the step of filling a first column and a second column corresponding to the API name based on the first word includes:
segmenting the API through a regular matching mode according to a large hump method naming rule of the API to obtain a first word and a second word in the API;
the first word is populated in a first column corresponding to the API name and the second word is populated in a second column corresponding to the API name.
As an optimized technical solution, the step of generating a new field based on the thread ID and the sequence number of the API call in the thread includes:
generating a first field based on the thread ID and a difference between the order number of the API call and the thread ID;
calculating a first difference value of two times before and after the API calling sequence number by taking the file name and the thread ID as grouping objects;
the contents of two adjacent first words corresponding to the same thread ID are concatenated to generate a second field.
The step S103 of constructing a feature engineering set based on the preprocessed data includes:
taking the thread ID as a grouping object to count the global features;
taking a thread ID and a preset field as grouping objects to count the local combination characteristics;
and splicing the global features and the local combined features into the feature engineering set by taking the thread ID as a main key.
As an optimized technical solution, the step of taking the thread ID as a grouping object to count the global features includes:
counting the times of the first word and the second word after the first word is removed from the list and counting the times of the second word after the second word is removed from the list by taking the thread ID as a grouping object;
taking the thread ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the coefficient of variation and the deviation of the median and the mean value of the thread ID;
taking the thread ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the sequence number called by the API;
taking the thread ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the first field;
and taking the thread ID as a grouping object, and counting the occurrence times of the second field of the API and the occurrence times after the duplication is eliminated.
As an optimized technical scheme, the step of taking the thread ID and the preset field as grouping objects to count the local combination features comprises the following steps:
taking the thread ID and the first word as grouping objects, and counting the occurrence frequency of each second word and the occurrence frequency after the duplication is eliminated;
taking the thread ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the times after the duplication elimination of each first difference value;
and taking the thread ID and the second field as grouping objects, and counting the occurrence times of each second word.
As an optimized technical solution, in the step S104, the step of modifying the model based on the feature engineering set and the preset threshold includes:
taking the characteristic engineering set and the file types corresponding to the files as the input of a model to carry out iterative learning, and outputting the probability of the file type corresponding to each file ID;
modifying the file type with the maximum probability value smaller than a preset threshold value and the original file type of unknown file into a pseudo label of normal to form a new data set;
and performing iterative learning for preset times by taking the new data set as the input of the model to finish the correction of the model.
As an optimized technical solution, in step S101, the step of classifying the collected files includes: and scanning the collected files through antivirus software, and confirming the file types according to the scanning results.
The invention also provides a malicious file detection system based on the API, which comprises:
the parameter confirmation module is used for classifying the collected files to confirm the file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
the preprocessing module is used for preprocessing parameters called during the operation of the known file to serve as model training data;
the feature construction module is used for constructing a feature engineering set based on the preprocessed data, and the feature engineering set comprises: global features and local combined features;
the correction module is used for constructing a model and correcting the model based on the characteristic engineering set and a preset threshold;
and the detection module is used for detecting the acquired unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
The invention has the advantages that: the invention provides a malicious file detection method based on an Application Program Interface (API), which is characterized in that a feature project is constructed through calling the API and a Thread Identification (TID) when a file runs, a classification model is trained, and whether an unknown file belongs to a malicious file or not is judged.
Meanwhile, the processing of the API is mainly aimed at, keywords in the API are extracted, and features are constructed, so that feature dimensions and feature sparsity are reduced, and the efficiency and accuracy of the model are improved; moreover, through model prediction, the probability values of the unknown files belonging to various categories can be output, and the possibility that the unknown files belong to malicious files, namely scores, is quantized; in addition, under the condition that no 'normal' label file exists, a pseudo label 'normal' data set is generated by using the model, and a multi-classification model with certain recognition capability on the 'normal' file is trained, so that whether the unknown file is a malicious file or not and the classification of the unknown file are predicted.
Drawings
Fig. 1 is a general flowchart of a malicious file detection method based on API in embodiment 1 of the present invention.
Fig. 2 is a block diagram of a malicious file detection system based on API in embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As described in the background, existing methods for detecting malicious files all have problems to some extent. From the perspective of actual services, the method optimizes the problems in the prior art, such as the construction of 2-gram combinations of the API, the splitting and merging of the API, the merging processing of the low-frequency API and the like, avoids the problem of feature sparsity, reduces feature dimensions, and expands the features of the API on a calling time sequence, thereby improving the accuracy and efficiency of the model. On the other hand, in an actual environment, a lot of unlabeled samples are available, but the labeled samples are limited, and the method also solves the problem that a class model which predicts that the normal file is contained cannot be built under the condition that only a large number of unlabeled files have no normal file, namely the method for generating the normal file by building the model in a pseudo-label mode.
Example one
Referring to fig. 1, the present invention provides a malicious file detection method based on API, which specifically includes the following steps:
s101, classifying collected files to confirm file types, putting the files into a sandbox for operation, and recording parameters called when the files operate;
the file category comprises known files and unknown files;
the parameters include: API (application program interface) name, thread ID (tid), and the sequence number (index) of API calls in the thread. In the file operation process, a plurality of APIs and tids are generally called, different tids do not have a precedence relationship, and indexes in the same tid represent the called precedence relationship from small to large, but may not be continuous.
Wherein the step of classifying the collected files comprises: and scanning the collected files through antivirus software, and confirming the file types according to the scanning results.
S102, preprocessing is carried out based on parameters called when a known file runs to serve as model training data;
the step of preprocessing based on parameters called by the known file runtime comprises:
s1021, the API name is segmented to obtain a first word and a second word, and a first column and a second column corresponding to the API name are filled based on the first word:
s10211, according to the big hump method naming rule of the API, segmenting the API in a regular matching mode to obtain a first word and a second word in the API, for example, after the 'CreateFileW' is segmented, the first word is 'Create', and the second word is 'File';
s10212 populates the first word in a first column corresponding to the API name and populates the second word in a second column corresponding to the API name.
S1022, merging the API names based on the content of the first word to reduce feature dimensions;
for the situation that the API does not contain capital letters, the first word is filled into the API, and the second word is subjected to null processing, so that the first word and the second word are added. By this processing, some APIs having the same or similar functions, such as "CreateFileW" and "CreateFileA" can be merged, thereby reducing feature dimensions.
S1023, performing low-frequency API optimization, namely performing API optimization based on the number of files corresponding to the first column;
s1024, generating a new field based on the thread ID and the sequence number called by the API in the thread, and comprising the following steps:
generating a first field based on the thread ID and a difference between a sequence number (index) of the API call and the thread ID (tid);
respectively taking a file ID (file _ ID) and a thread ID as grouping objects, and calculating a first difference value of two times before and after the API calling sequence number;
the contents of two adjacent first words corresponding to the same file ID are concatenated to generate a second field.
And S1025, converting the file type into a numerical value, and finishing label coding mapping.
For example, "trojan" is mapped to a value of 0, "worm virus" is mapped to a value of 1, "malicious web file" is mapped to a value of 2, "unknown" is mapped to a value of 3, and so on.
The label (file category) includes but is not limited to the following categories: trojan, worm virus, macro virus document, downloader, virus program, malicious web file, suspicious program, backdoor program, game/smile file, unknown and the like, wherein the unknown means that the antivirus software cannot judge whether the file is a malicious file or not, but does not represent that the file is a normal file.
The large hump naming convention means that variable names or function names are linked together by one or more words, and the initials of each word are capitalized as "CreateFileW".
S103, feature engineering, namely constructing the feature engineering based on the data processed in the step S102, wherein the feature engineering mainly comprises two parts, namely global features and local combined features, and specifically comprises the following steps:
and S1031, taking a file ID (file _ ID) as a grouping object to count the global features, wherein the file ID mainly comprises the following parts:
counting the number of times (fileid _ API1_ count) that the first word (firstword) appears and the number of times (fileid _ API1_ nunique) that the first word (firstword) appears after repetition with the file ID as a grouping object; counting the number of times (fileid _ API2_ nunique) that the second word (second word) appears after being removed with the file ID as a grouping object;
taking the file ID as a grouping object, counting the maximum value (fileid _ tid _ max), the minimum value (fileid _ tid _ min), the average value (fileid _ tid _ mean), the median (fileid _ tid _ mean), the standard deviation (fileid _ tid _ std), the number of times after the duplication removal (fileid _ tid _ unique), the dispersion (filed _ tid _ dis), the variation coefficient (fileid _ tid _ cv) and the deviation of the median from the average value (fileid _ tid _ sk);
taking the file ID as a grouping object, counting the maximum value (fileid _ index _ max), the minimum value (fileid _ index _ min), the mean value (fileid _ index _ mean), the median (fileid _ index _ mean), the standard deviation (fileid _ index _ std), the number of times after the duplication (fileid _ index _ unique), the dispersion (fileid _ index _ dis), the coefficient of variation (fileid _ index _ cv) and the deviation (fileid _ index _ sk) of the median and the mean value of the sequence number (index) called by the API;
taking the file ID as a grouping object, counting the maximum value (file _ inx _ tid _ max), the minimum value (file _ inx _ tid _ min), the mean value (file _ inx _ tid _ mean), the median (file _ inx _ tid _ mean), the standard deviation (file _ inx _ tid _ std), the number of times of duplication removal (file _ inx _ tid _ unique), the dispersion (file _ inx _ tid _ dis), the variation coefficient (file _ inx _ tid _ cv) and the deviation degree of the median from the mean value (file _ inx _ tid _ sk);
with the file ID as a grouping object, the number of occurrences (file _ API _2N _ count) of the API second field (API _2N) and the number of occurrences (file _ API _2N _ nunique) after the duplication are counted.
S1032, taking a file ID (file _ ID) and preset field combination as a grouping object to count local combination characteristics, further taking the file ID as a main key, and taking the preset field as a characteristic set generated by unfolding and transposing a column name, wherein the characteristic set mainly comprises the following parts:
taking the file ID and the first word (firstword) as grouping objects, and counting the occurrence frequency of each second word (second word) and the occurrence frequency after the second word (second word) is removed;
taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the times after the duplication of each first difference value (index _ diff);
and counting the occurrence frequency of each second word (second word) by taking the file ID and the second field (API _2N) as grouping objects.
S1033, with the thread file ID as a main key, splicing the two part feature sets of the global feature and the local combination feature into a feature set.
The dispersion (fileid _ tid _ dis) is the number of times after the duplication removal (fileid _ tid _ unique)/the total number of times (fileid _ API1_ count) of tid;
the coefficient of variation (fileid _ tid _ cv) is the standard deviation of tid (fileid _ tid _ std)/mean of tid (fileid _ tid _ mean);
the deviation degree (fileid _ tid _ sk) of the median from the mean value is the median of tid (fileid _ tid _ mean)/the mean value of tid (fileid _ tid _ mean);
s104, model construction is carried out, and the model is corrected based on the characteristic engineering set and a preset threshold value:
in view of the fact that the existing data set does not contain records of which label is 'normal', files which cannot be judged by part of antivirus software need to be corrected to be 'normal' records according to the initial training result of the model, and then the model is trained again, so that the model has certain capacity of identifying 'normal' files, and the specific implementation process is as follows:
taking the feature set extracted in the step S103 and the file category corresponding to the file ID as the input of the LightGBM multi-classification model, and outputting the probability of the file category corresponding to each file ID through the model after repeated iterative learning;
modifying the file type of the file ID with the maximum probability value smaller than a preset threshold (35%) and the original file type of unknown into a pseudo label ' normal ', removing the record with the file type of unknown, adding the record with the pseudo label of normal ', and mapping the file type ' normal ' into a numerical value 3 to form a new data set which is used as the input of the lightGBM multi-classification model;
and performing iterative learning for a preset number of times by taking the new data set as the input of the model, and storing the model after multiple iterative training to finish the correction of the model.
The LightGBM multi-classification model is a distributed gradient lifting algorithm model based on a decision tree, and the core idea of the LightGBM multi-classification model mainly comprises a Histogram strategy, a leaf-wise growth strategy, a GOSS sampling strategy and the like. The idea of Histogram is mainly to convert continuous characteristic values into box (bin) data through discretization, the specific process is to determine how many boxes (bins) are needed for each characteristic, then to divide equally, to update the sample data belonging to the box into the value of the box (bin), and finally to express the value by Histogram. By the method, the problem that other gradient lifting algorithms are high in cost and long in time for searching the optimal tangent point of each feature is solved. And the LightGBM adopts a Leaf-wise growth strategy, finds out one Leaf with the maximum splitting gain from all the current leaves each time, then splits the leaves, and circulates in such a way, and compared with a level-wise growth strategy, the Leaf-wise growth strategy can reduce more errors and obtain better precision under the condition of the same splitting times. The GOSS sampling strategy is a strategy for relatively balancing data volume reduction and precision guarantee, and the calculated amount is reduced by distinguishing the examples with different gradients, reserving the examples with larger gradients and simultaneously randomly sampling the smaller gradients, so that the calculation efficiency is improved.
And S105, detecting the model, namely detecting the collected unknown file based on the corrected model so as to determine whether the unknown file is a malicious file.
Example two
Referring to fig. 2, the present invention provides a system corresponding to the API-based malicious file detection method according to the first embodiment, and specifically includes the following modules:
the module 101 is a parameter confirmation module, and is configured to perform the step of step S101 in the first embodiment, that is, to classify the acquired files to confirm the file types, place the files in a sandbox for operation, and record parameters called when the files are operated; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
a module 102, a preprocessing module, configured to perform the step of step S102 in the first embodiment, that is, perform preprocessing based on parameters called when a known file runs, so as to serve as model training data;
a module 103, a feature construction module, configured to execute the step S103 in the first embodiment, that is, to construct a feature engineering set based on the preprocessed data, where the feature engineering set includes: global features and local combined features;
a module 104, a modification module, configured to execute the step S104 in the first embodiment, that is, to construct a model and modify the model based on the feature engineering set and a preset threshold;
the module 105, a detection module, is configured to execute the step of step S105 in the first embodiment, that is, to detect the acquired unknown file based on the modified model, so as to determine whether the unknown file is a malicious file.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A malicious file detection method based on API is characterized by comprising the following steps:
s101, classifying collected files to confirm file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
s102, preprocessing is carried out based on parameters called when a known file runs to serve as model training data;
s103, constructing a feature engineering set based on the preprocessed data, wherein the feature engineering set comprises: global features and local combined features;
s104, constructing a model, and correcting the model based on the characteristic engineering set and a preset threshold value;
and S105, detecting the collected unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
2. The malicious file detection method according to claim 1, wherein in step S102, the preprocessing based on the parameters called by the known file runtime includes:
segmenting the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;
merging the API names based on the contents of the first column to reduce feature dimensions;
optimizing the API based on the number of the files corresponding to the first column;
generating a new field based on the thread ID and the sequence number of the API call in the thread;
and converting the file type into a numerical value, and finishing label coding mapping.
3. The API-based malicious file detection method according to claim 2, wherein the API name is segmented to obtain a first word and a second word, and the step of populating a first column and a second column corresponding to the API name based on the first word comprises:
segmenting the API through a regular matching mode according to a large hump method naming rule of the API to obtain a first word and a second word in the API;
the first word is populated in a first column corresponding to the API name and the second word is populated in a second column corresponding to the API name.
4. The API-based malware detection method of claim 2, wherein said step of generating new fields based on said thread ID and the sequence number of API calls in the thread comprises:
generating a first field based on the thread ID and a difference between the order number of the API call and the thread ID;
respectively taking the file ID and the thread ID as grouping objects, and calculating a first difference value of two times before and after the API calling sequence number;
the contents of two adjacent first words corresponding to the same file ID are concatenated to generate a second field.
5. The API-based malicious file detection method according to claim 4, wherein the step S103 of constructing a feature engineering set based on the preprocessed data includes:
taking the file ID as a grouping object to count the global features;
taking the file ID and a preset field as grouping objects to count the local combination characteristics;
and splicing the global features and the local combined features into the feature engineering set by taking the thread file ID as a main key.
6. The API-based malicious file detection method according to claim 5, wherein the step of counting the global characteristics with file IDs as grouping objects comprises:
taking the file ID as a grouping object, and counting the times of the first word and the second word after the duplication elimination;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the coefficient of variation and the deviation of the median and the mean value of the thread ID;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the sequence number called by the API;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the first field;
and taking the file ID as a grouping object, and counting the occurrence times of the second field of the API and the occurrence times after the duplication is eliminated.
7. The API-based malicious file detection method according to claim 5, wherein the step of counting the local combined features with a file ID and a preset field as grouping objects comprises:
taking the file ID and the first word as grouping objects, and counting the occurrence frequency of each second word and the occurrence frequency after the duplication is removed;
taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the times after the duplication elimination of each first difference value;
and counting the occurrence times of each second word by taking the file ID and the second field as grouping objects.
8. The API-based malicious file detection method according to claim 1, wherein the step of modifying the model based on the feature engineering set and the preset threshold in step S104 comprises:
taking the characteristic engineering set and the file types corresponding to the files as the input of a model to carry out iterative learning, and outputting the probability of the file type corresponding to each file ID;
modifying the file type with the maximum probability value smaller than a preset threshold value and the original file type of unknown file into a pseudo label of normal to form a new data set;
and performing iterative learning for preset times by taking the new data set as the input of the model to finish the correction of the model.
9. The API-based malicious file detection method according to claim 1, wherein in step S101, the step of classifying the collected files comprises: and scanning the collected files through antivirus software, and confirming the file types according to the scanning results.
10. An API-based malicious file detection system, comprising:
the parameter confirmation module is used for classifying the collected files to confirm the file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
the preprocessing module is used for preprocessing parameters called during the operation of the known file to serve as model training data;
the feature construction module is used for constructing a feature engineering set based on the preprocessed data, and the feature engineering set comprises: global features and local combined features;
the correction module is used for constructing a model and correcting the model based on the characteristic engineering set and a preset threshold;
and the detection module is used for detecting the acquired unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110749396.XA CN113378156B (en) | 2021-07-01 | 2021-07-01 | API-based malicious file detection method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110749396.XA CN113378156B (en) | 2021-07-01 | 2021-07-01 | API-based malicious file detection method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113378156A true CN113378156A (en) | 2021-09-10 |
CN113378156B CN113378156B (en) | 2023-07-11 |
Family
ID=77580639
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110749396.XA Active CN113378156B (en) | 2021-07-01 | 2021-07-01 | API-based malicious file detection method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113378156B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117193889A (en) * | 2023-08-02 | 2023-12-08 | 上海澜码科技有限公司 | Construction method of code example library and use method of code example library |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140189139A1 (en) * | 2012-12-28 | 2014-07-03 | Microsoft Corporation | Seamlessly playing a composite media presentation |
US20160072833A1 (en) * | 2014-09-04 | 2016-03-10 | Electronics And Telecommunications Research Institute | Apparatus and method for searching for similar malicious code based on malicious code feature information |
US20160241560A1 (en) * | 2015-02-13 | 2016-08-18 | Instart Logic, Inc. | Client-site dom api access control |
CN109508545A (en) * | 2018-11-09 | 2019-03-22 | 北京大学 | A kind of Android Malware classification method based on rarefaction representation and Model Fusion |
CN109543751A (en) * | 2018-11-22 | 2019-03-29 | 南京中孚信息技术有限公司 | Method for mode matching, device and electronic equipment based on multithreading |
CN110826320A (en) * | 2019-11-28 | 2020-02-21 | 上海观安信息技术股份有限公司 | Sensitive data discovery method and system based on text recognition |
CN111368289A (en) * | 2018-12-26 | 2020-07-03 | 中兴通讯股份有限公司 | Malicious software detection method and device |
CN111639337A (en) * | 2020-04-17 | 2020-09-08 | 中国科学院信息工程研究所 | Unknown malicious code detection method and system for massive Windows software |
CN111723371A (en) * | 2020-06-22 | 2020-09-29 | 上海斗象信息科技有限公司 | Method for constructing detection model of malicious file and method for detecting malicious file |
CN112241530A (en) * | 2019-07-19 | 2021-01-19 | 中国人民解放军战略支援部队信息工程大学 | Malicious PDF document detection method and electronic equipment |
CN112464234A (en) * | 2020-11-21 | 2021-03-09 | 西北工业大学 | SVM-based malicious software detection method on cloud platform |
CN112528284A (en) * | 2020-12-18 | 2021-03-19 | 北京明略软件***有限公司 | Malicious program detection method and device, storage medium and electronic equipment |
KR20210051669A (en) * | 2019-10-31 | 2021-05-10 | 삼성에스디에스 주식회사 | method for machine LEARNING of MALWARE DETECTING MODEL AND METHOD FOR detecting Malware USING THE SAME |
-
2021
- 2021-07-01 CN CN202110749396.XA patent/CN113378156B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140189139A1 (en) * | 2012-12-28 | 2014-07-03 | Microsoft Corporation | Seamlessly playing a composite media presentation |
US20160072833A1 (en) * | 2014-09-04 | 2016-03-10 | Electronics And Telecommunications Research Institute | Apparatus and method for searching for similar malicious code based on malicious code feature information |
US20160241560A1 (en) * | 2015-02-13 | 2016-08-18 | Instart Logic, Inc. | Client-site dom api access control |
CN109508545A (en) * | 2018-11-09 | 2019-03-22 | 北京大学 | A kind of Android Malware classification method based on rarefaction representation and Model Fusion |
CN109543751A (en) * | 2018-11-22 | 2019-03-29 | 南京中孚信息技术有限公司 | Method for mode matching, device and electronic equipment based on multithreading |
CN111368289A (en) * | 2018-12-26 | 2020-07-03 | 中兴通讯股份有限公司 | Malicious software detection method and device |
CN112241530A (en) * | 2019-07-19 | 2021-01-19 | 中国人民解放军战略支援部队信息工程大学 | Malicious PDF document detection method and electronic equipment |
KR20210051669A (en) * | 2019-10-31 | 2021-05-10 | 삼성에스디에스 주식회사 | method for machine LEARNING of MALWARE DETECTING MODEL AND METHOD FOR detecting Malware USING THE SAME |
CN110826320A (en) * | 2019-11-28 | 2020-02-21 | 上海观安信息技术股份有限公司 | Sensitive data discovery method and system based on text recognition |
CN111639337A (en) * | 2020-04-17 | 2020-09-08 | 中国科学院信息工程研究所 | Unknown malicious code detection method and system for massive Windows software |
CN111723371A (en) * | 2020-06-22 | 2020-09-29 | 上海斗象信息科技有限公司 | Method for constructing detection model of malicious file and method for detecting malicious file |
CN112464234A (en) * | 2020-11-21 | 2021-03-09 | 西北工业大学 | SVM-based malicious software detection method on cloud platform |
CN112528284A (en) * | 2020-12-18 | 2021-03-19 | 北京明略软件***有限公司 | Malicious program detection method and device, storage medium and electronic equipment |
Non-Patent Citations (8)
Title |
---|
MAZEROFF, G ET AL: "Probabilistic suffix models for API sequence analysis of Windows XP applications", 《ELSEVIER SCI LTDTHE BOULEVARD》 * |
MAZEROFF, G ET AL: "Probabilistic suffix models for API sequence analysis of Windows XP applications", 《ELSEVIER SCI LTDTHE BOULEVARD》, 31 December 2008 (2008-12-31) * |
YUN, J ET AL: "MiGuard: Detecting and Guarding against Malicious Iframe through API Hooking", 《IEICE-INST ELECTRONICS INFORMATION COMMUNICATION ENGINEERSKIKAI-SHINKO-KAIKAN BLDG》 * |
YUN, J ET AL: "MiGuard: Detecting and Guarding against Malicious Iframe through API Hooking", 《IEICE-INST ELECTRONICS INFORMATION COMMUNICATION ENGINEERSKIKAI-SHINKO-KAIKAN BLDG》, 31 December 2011 (2011-12-31) * |
姜冲等: "基于运行时行为序列分析的恶意行为检测***", 《计算机工程设计》 * |
姜冲等: "基于运行时行为序列分析的恶意行为检测***", 《计算机工程设计》, vol. 37, no. 3, 31 March 2016 (2016-03-31) * |
荣俸萍等: "MACSPMD:基于恶意API调用序列模式挖掘的恶意代码检测", 《计算机科学》 * |
荣俸萍等: "MACSPMD:基于恶意API调用序列模式挖掘的恶意代码检测", 《计算机科学》, no. 05, 15 May 2018 (2018-05-15) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117193889A (en) * | 2023-08-02 | 2023-12-08 | 上海澜码科技有限公司 | Construction method of code example library and use method of code example library |
CN117193889B (en) * | 2023-08-02 | 2024-03-08 | 上海澜码科技有限公司 | Construction method of code example library and use method of code example library |
Also Published As
Publication number | Publication date |
---|---|
CN113378156B (en) | 2023-07-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110351301B (en) | HTTP request double-layer progressive anomaly detection method | |
CN107145516B (en) | Text clustering method and system | |
CN113011889B (en) | Account anomaly identification method, system, device, equipment and medium | |
CN111723371B (en) | Method for constructing malicious file detection model and detecting malicious file | |
CN114816909A (en) | Real-time log detection early warning method and system based on machine learning | |
CN109829302B (en) | Android malicious application family classification method and device and electronic equipment | |
CN111368289B (en) | Malicious software detection method and device | |
CN111382783A (en) | Malicious software identification method and device and storage medium | |
CN112884204A (en) | Network security risk event prediction method and device | |
CN115189914A (en) | Application Programming Interface (API) identification method and device for network traffic | |
CN113378156A (en) | Malicious file detection method and system based on API | |
CN105468972B (en) | A kind of mobile terminal document detection method | |
CN111988327B (en) | Threat behavior detection and model establishment method and device, electronic equipment and storage medium | |
CN116032741A (en) | Equipment identification method and device, electronic equipment and computer storage medium | |
CN111460447B (en) | Malicious file detection method and device, electronic equipment and storage medium | |
CN111414621B (en) | Malicious webpage file identification method and device | |
CN113971283A (en) | Malicious application program detection method and device based on features | |
CN116821903A (en) | Detection rule determination and malicious binary file detection method, device and medium | |
CN107622201B (en) | A kind of Android platform clone's application program rapid detection method of anti-reinforcing | |
CN112163217B (en) | Malware variant identification method, device, equipment and computer storage medium | |
US11868473B2 (en) | Method for constructing behavioural software signatures | |
CN114491528A (en) | Malicious software detection method, device and equipment | |
CN114398887A (en) | Text classification method and device and electronic equipment | |
CN113722713A (en) | Malicious code detection method and device, electronic equipment and storage medium | |
CN113298504A (en) | Service big data grouping identification method and system based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |