CN113378156A - Malicious file detection method and system based on API - Google Patents

Malicious file detection method and system based on API Download PDF

Info

Publication number
CN113378156A
CN113378156A CN202110749396.XA CN202110749396A CN113378156A CN 113378156 A CN113378156 A CN 113378156A CN 202110749396 A CN202110749396 A CN 202110749396A CN 113378156 A CN113378156 A CN 113378156A
Authority
CN
China
Prior art keywords
file
api
word
files
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110749396.XA
Other languages
Chinese (zh)
Other versions
CN113378156B (en
Inventor
梁淑云
殷钱安
余贤喆
王启凡
陶景龙
徐�明
刘胜
马影
周晓勇
魏国富
夏玉明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN202110749396.XA priority Critical patent/CN113378156B/en
Publication of CN113378156A publication Critical patent/CN113378156A/en
Application granted granted Critical
Publication of CN113378156B publication Critical patent/CN113378156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/52Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow
    • G06F21/53Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems during program execution, e.g. stack integrity ; Preventing unwanted data erasure; Buffer overflow by executing in a restricted environment, e.g. sandbox or secure virtual machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4488Object-oriented
    • G06F9/449Object-oriented method invocation or resolution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a malicious file detection method and a system based on an API (application program interface), wherein the method comprises the following steps: putting the file into a sandbox for operation, and simultaneously recording the API name and tid called during the operation of the file and the sequence number index called by the API in the thread; data preprocessing, comprising: processing an API in data, optimizing the low-frequency API, generating a new field and mapping label codes; constructing a feature project based on the processed data, wherein the feature project comprises global features and local combination features, and the two feature sets are finally spliced into a feature set; correcting files which cannot be judged by part of antivirus software into normal records according to the initial training result of the model, and further training the model again; and (5) model prediction. The invention also provides a malicious file detection system based on the API. The method has a certain recognition rate for various malicious files bypassing the feature codes and sandbox detection, and can improve the generalization capability of malicious file detection.

Description

Malicious file detection method and system based on API
Technical Field
The invention relates to the technical field of information security services, in particular to a malicious file detection method and system based on an API.
Background
In recent years, along with the development of computer technology, intelligent terminals and network technology are widely applied, and malicious files are spread and mutated to some extent. Legal files are used for strengthening and expanding the capability of a computer, so that the work and the life of people are facilitated; the malicious files are intended to steal or destroy computer data, etc., which may further bring economic loss and mental trouble to enterprises and individuals. Therefore, malicious files are detected in time, threats brought by the malicious files are blocked, and the maintenance of the health and the safety of the network environment is more and more important.
The current detection method of the malicious file mainly comprises a feature code method, a sandbox detection technology and the like. The method is characterized in that a common method is a feature code method, a virus feature code library needs to be established and maintained, whether a file belongs to a malicious file is detected by inquiring whether the file contains the feature codes, the method is high in detection speed, but the malicious file containing unknown feature codes cannot be detected, and once the malicious file is detected by means of deformation, encryption, shell adding and the like, the feature codes can be avoided. In recent years, sandbox technology is more and more widely used, and the method simulates a normal environment to run unknown files, records file running actions, and matches the actions with a malicious file library to judge whether the unknown files belong to malicious files. With the popularization of machine learning application, some learning methods for constructing machine learning models to detect malicious files also appear. For example, patent document No. 202010572487.6 discloses a method for constructing a detection model of a malicious file and detecting the malicious file, and a plurality of normal samples and a plurality of malicious samples are obtained and labeled respectively; filtering out the malicious samples without shells in the malicious samples; establishing a static model, comprising: obtaining PE formats of a plurality of normal samples and a plurality of malicious samples; converting the data into a plurality of characteristic vectors according to the obtained PE format of each sample; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters; inputting the feature vector associated with the label into a random forest model and a LightGBM model, and respectively establishing the random forest model and the LightGBM model for statically detecting the malicious file; establishing a dynamic model, comprising: putting the normal samples and the malicious samples into a sandbox to obtain a sandbox report, and acquiring feature vectors of the samples in the sandbox report, wherein the feature vectors relate to an API (application program interface), a tid (endpoint), a return _ value and an index; combining a plurality of the feature vectors and associating with a tag; adjusting the random forest model and the LightGBM model to optimal parameters, and establishing an important characteristic random forest model; inputting the feature vectors associated with the labels into a random forest model, an important feature random forest model and a LightGBM model, and respectively establishing the random forest model, the important feature random forest model and the LightGBM model for dynamically detecting malicious files; fusing all the static models and all the dynamic models to obtain a fused model; and calculating to obtain a final malicious score according to the total malicious suspicious score obtained by the fusion model and the malicious suspicious score obtained by the malheur model, and detecting the sample according to the final malicious score.
Although the sandbox detection technology avoids the defect that the characteristic code method cannot detect unknown malicious files to a certain extent, attackers are also looking for various methods to bypass the detection of the sandbox, such as detecting system characteristics, delaying operations and the like, so that whether part of unknown files belong to malicious files cannot be judged in the sandbox.
Although the existing method for modeling by utilizing machine learning solves the problem that a sandbox detection technology is bypassed to a certain extent, the characteristic engineering of the method is biased to statistic characteristics, characteristic differences of different files in API calling time sequences are ignored, and characteristic sparsity is ignored during characteristic processing, so that the accuracy and efficiency of a model are possibly reduced. APIAPIAPIAPI
Disclosure of Invention
The technical problem to be solved by the invention is how to judge whether the unknown file belongs to a malicious file.
The invention solves the technical problems through the following technical means: a malicious file detection method based on API comprises the following steps:
s101, classifying collected files to confirm file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
s102, preprocessing is carried out based on parameters called when a known file runs to serve as model training data;
s103, constructing a feature engineering set based on the preprocessed data, wherein the feature engineering set comprises: global features and local combined features;
s104, constructing a model, and correcting the model based on the characteristic engineering set and a preset threshold value;
and S105, detecting the collected unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
The method mainly aims at the processing of the API, extracts keywords in the API and constructs the characteristics, so that the characteristic dimension and the characteristic sparsity are reduced, and the efficiency and the accuracy of the model are improved; moreover, through model detection, probability values of unknown files belonging to various categories can be output, and the possibility that the unknown files belong to malicious files, namely scores, is quantized; in addition, under the condition that no 'normal' label file exists, a pseudo label 'normal' data set is generated by using the model, and a multi-classification model with certain recognition capability on the 'normal' file is trained, so that whether the unknown file is a malicious file or not and the classification of the unknown file are predicted.
As an optimized technical solution, in the step S102, the step of preprocessing based on the parameter called by the known file runtime includes:
segmenting the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;
merging the API names based on the contents of the first column to reduce feature dimensions;
optimizing the API based on the number of the files corresponding to the first column;
generating a new field based on the thread ID and the sequence number of the API call in the thread;
and converting the file type into a numerical value, and finishing label coding mapping. As an optimized technical solution, the step of segmenting the API name to obtain a first word and a second word, and the step of filling a first column and a second column corresponding to the API name based on the first word includes:
segmenting the API through a regular matching mode according to a large hump method naming rule of the API to obtain a first word and a second word in the API;
the first word is populated in a first column corresponding to the API name and the second word is populated in a second column corresponding to the API name.
As an optimized technical solution, the step of generating a new field based on the thread ID and the sequence number of the API call in the thread includes:
generating a first field based on the thread ID and a difference between the order number of the API call and the thread ID;
calculating a first difference value of two times before and after the API calling sequence number by taking the file name and the thread ID as grouping objects;
the contents of two adjacent first words corresponding to the same thread ID are concatenated to generate a second field.
The step S103 of constructing a feature engineering set based on the preprocessed data includes:
taking the thread ID as a grouping object to count the global features;
taking a thread ID and a preset field as grouping objects to count the local combination characteristics;
and splicing the global features and the local combined features into the feature engineering set by taking the thread ID as a main key.
As an optimized technical solution, the step of taking the thread ID as a grouping object to count the global features includes:
counting the times of the first word and the second word after the first word is removed from the list and counting the times of the second word after the second word is removed from the list by taking the thread ID as a grouping object;
taking the thread ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the coefficient of variation and the deviation of the median and the mean value of the thread ID;
taking the thread ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the sequence number called by the API;
taking the thread ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the first field;
and taking the thread ID as a grouping object, and counting the occurrence times of the second field of the API and the occurrence times after the duplication is eliminated.
As an optimized technical scheme, the step of taking the thread ID and the preset field as grouping objects to count the local combination features comprises the following steps:
taking the thread ID and the first word as grouping objects, and counting the occurrence frequency of each second word and the occurrence frequency after the duplication is eliminated;
taking the thread ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the times after the duplication elimination of each first difference value;
and taking the thread ID and the second field as grouping objects, and counting the occurrence times of each second word.
As an optimized technical solution, in the step S104, the step of modifying the model based on the feature engineering set and the preset threshold includes:
taking the characteristic engineering set and the file types corresponding to the files as the input of a model to carry out iterative learning, and outputting the probability of the file type corresponding to each file ID;
modifying the file type with the maximum probability value smaller than a preset threshold value and the original file type of unknown file into a pseudo label of normal to form a new data set;
and performing iterative learning for preset times by taking the new data set as the input of the model to finish the correction of the model.
As an optimized technical solution, in step S101, the step of classifying the collected files includes: and scanning the collected files through antivirus software, and confirming the file types according to the scanning results.
The invention also provides a malicious file detection system based on the API, which comprises:
the parameter confirmation module is used for classifying the collected files to confirm the file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
the preprocessing module is used for preprocessing parameters called during the operation of the known file to serve as model training data;
the feature construction module is used for constructing a feature engineering set based on the preprocessed data, and the feature engineering set comprises: global features and local combined features;
the correction module is used for constructing a model and correcting the model based on the characteristic engineering set and a preset threshold;
and the detection module is used for detecting the acquired unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
The invention has the advantages that: the invention provides a malicious file detection method based on an Application Program Interface (API), which is characterized in that a feature project is constructed through calling the API and a Thread Identification (TID) when a file runs, a classification model is trained, and whether an unknown file belongs to a malicious file or not is judged.
Meanwhile, the processing of the API is mainly aimed at, keywords in the API are extracted, and features are constructed, so that feature dimensions and feature sparsity are reduced, and the efficiency and accuracy of the model are improved; moreover, through model prediction, the probability values of the unknown files belonging to various categories can be output, and the possibility that the unknown files belong to malicious files, namely scores, is quantized; in addition, under the condition that no 'normal' label file exists, a pseudo label 'normal' data set is generated by using the model, and a multi-classification model with certain recognition capability on the 'normal' file is trained, so that whether the unknown file is a malicious file or not and the classification of the unknown file are predicted.
Drawings
Fig. 1 is a general flowchart of a malicious file detection method based on API in embodiment 1 of the present invention.
Fig. 2 is a block diagram of a malicious file detection system based on API in embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As described in the background, existing methods for detecting malicious files all have problems to some extent. From the perspective of actual services, the method optimizes the problems in the prior art, such as the construction of 2-gram combinations of the API, the splitting and merging of the API, the merging processing of the low-frequency API and the like, avoids the problem of feature sparsity, reduces feature dimensions, and expands the features of the API on a calling time sequence, thereby improving the accuracy and efficiency of the model. On the other hand, in an actual environment, a lot of unlabeled samples are available, but the labeled samples are limited, and the method also solves the problem that a class model which predicts that the normal file is contained cannot be built under the condition that only a large number of unlabeled files have no normal file, namely the method for generating the normal file by building the model in a pseudo-label mode.
Example one
Referring to fig. 1, the present invention provides a malicious file detection method based on API, which specifically includes the following steps:
s101, classifying collected files to confirm file types, putting the files into a sandbox for operation, and recording parameters called when the files operate;
the file category comprises known files and unknown files;
the parameters include: API (application program interface) name, thread ID (tid), and the sequence number (index) of API calls in the thread. In the file operation process, a plurality of APIs and tids are generally called, different tids do not have a precedence relationship, and indexes in the same tid represent the called precedence relationship from small to large, but may not be continuous.
Wherein the step of classifying the collected files comprises: and scanning the collected files through antivirus software, and confirming the file types according to the scanning results.
S102, preprocessing is carried out based on parameters called when a known file runs to serve as model training data;
the step of preprocessing based on parameters called by the known file runtime comprises:
s1021, the API name is segmented to obtain a first word and a second word, and a first column and a second column corresponding to the API name are filled based on the first word:
s10211, according to the big hump method naming rule of the API, segmenting the API in a regular matching mode to obtain a first word and a second word in the API, for example, after the 'CreateFileW' is segmented, the first word is 'Create', and the second word is 'File';
s10212 populates the first word in a first column corresponding to the API name and populates the second word in a second column corresponding to the API name.
S1022, merging the API names based on the content of the first word to reduce feature dimensions;
for the situation that the API does not contain capital letters, the first word is filled into the API, and the second word is subjected to null processing, so that the first word and the second word are added. By this processing, some APIs having the same or similar functions, such as "CreateFileW" and "CreateFileA" can be merged, thereby reducing feature dimensions.
S1023, performing low-frequency API optimization, namely performing API optimization based on the number of files corresponding to the first column;
s1024, generating a new field based on the thread ID and the sequence number called by the API in the thread, and comprising the following steps:
generating a first field based on the thread ID and a difference between a sequence number (index) of the API call and the thread ID (tid);
respectively taking a file ID (file _ ID) and a thread ID as grouping objects, and calculating a first difference value of two times before and after the API calling sequence number;
the contents of two adjacent first words corresponding to the same file ID are concatenated to generate a second field.
And S1025, converting the file type into a numerical value, and finishing label coding mapping.
For example, "trojan" is mapped to a value of 0, "worm virus" is mapped to a value of 1, "malicious web file" is mapped to a value of 2, "unknown" is mapped to a value of 3, and so on.
The label (file category) includes but is not limited to the following categories: trojan, worm virus, macro virus document, downloader, virus program, malicious web file, suspicious program, backdoor program, game/smile file, unknown and the like, wherein the unknown means that the antivirus software cannot judge whether the file is a malicious file or not, but does not represent that the file is a normal file.
The large hump naming convention means that variable names or function names are linked together by one or more words, and the initials of each word are capitalized as "CreateFileW".
S103, feature engineering, namely constructing the feature engineering based on the data processed in the step S102, wherein the feature engineering mainly comprises two parts, namely global features and local combined features, and specifically comprises the following steps:
and S1031, taking a file ID (file _ ID) as a grouping object to count the global features, wherein the file ID mainly comprises the following parts:
counting the number of times (fileid _ API1_ count) that the first word (firstword) appears and the number of times (fileid _ API1_ nunique) that the first word (firstword) appears after repetition with the file ID as a grouping object; counting the number of times (fileid _ API2_ nunique) that the second word (second word) appears after being removed with the file ID as a grouping object;
taking the file ID as a grouping object, counting the maximum value (fileid _ tid _ max), the minimum value (fileid _ tid _ min), the average value (fileid _ tid _ mean), the median (fileid _ tid _ mean), the standard deviation (fileid _ tid _ std), the number of times after the duplication removal (fileid _ tid _ unique), the dispersion (filed _ tid _ dis), the variation coefficient (fileid _ tid _ cv) and the deviation of the median from the average value (fileid _ tid _ sk);
taking the file ID as a grouping object, counting the maximum value (fileid _ index _ max), the minimum value (fileid _ index _ min), the mean value (fileid _ index _ mean), the median (fileid _ index _ mean), the standard deviation (fileid _ index _ std), the number of times after the duplication (fileid _ index _ unique), the dispersion (fileid _ index _ dis), the coefficient of variation (fileid _ index _ cv) and the deviation (fileid _ index _ sk) of the median and the mean value of the sequence number (index) called by the API;
taking the file ID as a grouping object, counting the maximum value (file _ inx _ tid _ max), the minimum value (file _ inx _ tid _ min), the mean value (file _ inx _ tid _ mean), the median (file _ inx _ tid _ mean), the standard deviation (file _ inx _ tid _ std), the number of times of duplication removal (file _ inx _ tid _ unique), the dispersion (file _ inx _ tid _ dis), the variation coefficient (file _ inx _ tid _ cv) and the deviation degree of the median from the mean value (file _ inx _ tid _ sk);
with the file ID as a grouping object, the number of occurrences (file _ API _2N _ count) of the API second field (API _2N) and the number of occurrences (file _ API _2N _ nunique) after the duplication are counted.
S1032, taking a file ID (file _ ID) and preset field combination as a grouping object to count local combination characteristics, further taking the file ID as a main key, and taking the preset field as a characteristic set generated by unfolding and transposing a column name, wherein the characteristic set mainly comprises the following parts:
taking the file ID and the first word (firstword) as grouping objects, and counting the occurrence frequency of each second word (second word) and the occurrence frequency after the second word (second word) is removed;
taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the times after the duplication of each first difference value (index _ diff);
and counting the occurrence frequency of each second word (second word) by taking the file ID and the second field (API _2N) as grouping objects.
S1033, with the thread file ID as a main key, splicing the two part feature sets of the global feature and the local combination feature into a feature set.
The dispersion (fileid _ tid _ dis) is the number of times after the duplication removal (fileid _ tid _ unique)/the total number of times (fileid _ API1_ count) of tid;
the coefficient of variation (fileid _ tid _ cv) is the standard deviation of tid (fileid _ tid _ std)/mean of tid (fileid _ tid _ mean);
the deviation degree (fileid _ tid _ sk) of the median from the mean value is the median of tid (fileid _ tid _ mean)/the mean value of tid (fileid _ tid _ mean);
s104, model construction is carried out, and the model is corrected based on the characteristic engineering set and a preset threshold value:
in view of the fact that the existing data set does not contain records of which label is 'normal', files which cannot be judged by part of antivirus software need to be corrected to be 'normal' records according to the initial training result of the model, and then the model is trained again, so that the model has certain capacity of identifying 'normal' files, and the specific implementation process is as follows:
taking the feature set extracted in the step S103 and the file category corresponding to the file ID as the input of the LightGBM multi-classification model, and outputting the probability of the file category corresponding to each file ID through the model after repeated iterative learning;
modifying the file type of the file ID with the maximum probability value smaller than a preset threshold (35%) and the original file type of unknown into a pseudo label ' normal ', removing the record with the file type of unknown, adding the record with the pseudo label of normal ', and mapping the file type ' normal ' into a numerical value 3 to form a new data set which is used as the input of the lightGBM multi-classification model;
and performing iterative learning for a preset number of times by taking the new data set as the input of the model, and storing the model after multiple iterative training to finish the correction of the model.
The LightGBM multi-classification model is a distributed gradient lifting algorithm model based on a decision tree, and the core idea of the LightGBM multi-classification model mainly comprises a Histogram strategy, a leaf-wise growth strategy, a GOSS sampling strategy and the like. The idea of Histogram is mainly to convert continuous characteristic values into box (bin) data through discretization, the specific process is to determine how many boxes (bins) are needed for each characteristic, then to divide equally, to update the sample data belonging to the box into the value of the box (bin), and finally to express the value by Histogram. By the method, the problem that other gradient lifting algorithms are high in cost and long in time for searching the optimal tangent point of each feature is solved. And the LightGBM adopts a Leaf-wise growth strategy, finds out one Leaf with the maximum splitting gain from all the current leaves each time, then splits the leaves, and circulates in such a way, and compared with a level-wise growth strategy, the Leaf-wise growth strategy can reduce more errors and obtain better precision under the condition of the same splitting times. The GOSS sampling strategy is a strategy for relatively balancing data volume reduction and precision guarantee, and the calculated amount is reduced by distinguishing the examples with different gradients, reserving the examples with larger gradients and simultaneously randomly sampling the smaller gradients, so that the calculation efficiency is improved.
And S105, detecting the model, namely detecting the collected unknown file based on the corrected model so as to determine whether the unknown file is a malicious file.
Example two
Referring to fig. 2, the present invention provides a system corresponding to the API-based malicious file detection method according to the first embodiment, and specifically includes the following modules:
the module 101 is a parameter confirmation module, and is configured to perform the step of step S101 in the first embodiment, that is, to classify the acquired files to confirm the file types, place the files in a sandbox for operation, and record parameters called when the files are operated; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
a module 102, a preprocessing module, configured to perform the step of step S102 in the first embodiment, that is, perform preprocessing based on parameters called when a known file runs, so as to serve as model training data;
a module 103, a feature construction module, configured to execute the step S103 in the first embodiment, that is, to construct a feature engineering set based on the preprocessed data, where the feature engineering set includes: global features and local combined features;
a module 104, a modification module, configured to execute the step S104 in the first embodiment, that is, to construct a model and modify the model based on the feature engineering set and a preset threshold;
the module 105, a detection module, is configured to execute the step of step S105 in the first embodiment, that is, to detect the acquired unknown file based on the modified model, so as to determine whether the unknown file is a malicious file.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A malicious file detection method based on API is characterized by comprising the following steps:
s101, classifying collected files to confirm file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
s102, preprocessing is carried out based on parameters called when a known file runs to serve as model training data;
s103, constructing a feature engineering set based on the preprocessed data, wherein the feature engineering set comprises: global features and local combined features;
s104, constructing a model, and correcting the model based on the characteristic engineering set and a preset threshold value;
and S105, detecting the collected unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
2. The malicious file detection method according to claim 1, wherein in step S102, the preprocessing based on the parameters called by the known file runtime includes:
segmenting the API name to obtain a first word and a second word, and filling a first column and a second column corresponding to the API name based on the first word;
merging the API names based on the contents of the first column to reduce feature dimensions;
optimizing the API based on the number of the files corresponding to the first column;
generating a new field based on the thread ID and the sequence number of the API call in the thread;
and converting the file type into a numerical value, and finishing label coding mapping.
3. The API-based malicious file detection method according to claim 2, wherein the API name is segmented to obtain a first word and a second word, and the step of populating a first column and a second column corresponding to the API name based on the first word comprises:
segmenting the API through a regular matching mode according to a large hump method naming rule of the API to obtain a first word and a second word in the API;
the first word is populated in a first column corresponding to the API name and the second word is populated in a second column corresponding to the API name.
4. The API-based malware detection method of claim 2, wherein said step of generating new fields based on said thread ID and the sequence number of API calls in the thread comprises:
generating a first field based on the thread ID and a difference between the order number of the API call and the thread ID;
respectively taking the file ID and the thread ID as grouping objects, and calculating a first difference value of two times before and after the API calling sequence number;
the contents of two adjacent first words corresponding to the same file ID are concatenated to generate a second field.
5. The API-based malicious file detection method according to claim 4, wherein the step S103 of constructing a feature engineering set based on the preprocessed data includes:
taking the file ID as a grouping object to count the global features;
taking the file ID and a preset field as grouping objects to count the local combination characteristics;
and splicing the global features and the local combined features into the feature engineering set by taking the thread file ID as a main key.
6. The API-based malicious file detection method according to claim 5, wherein the step of counting the global characteristics with file IDs as grouping objects comprises:
taking the file ID as a grouping object, and counting the times of the first word and the second word after the duplication elimination;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the coefficient of variation and the deviation of the median and the mean value of the thread ID;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the sequence number called by the API;
taking the file ID as a grouping object, and counting the maximum value, the minimum value, the mean value, the median, the standard deviation, the times after the duplication elimination, the dispersion, the variation coefficient and the deviation of the median and the mean value of the first field;
and taking the file ID as a grouping object, and counting the occurrence times of the second field of the API and the occurrence times after the duplication is eliminated.
7. The API-based malicious file detection method according to claim 5, wherein the step of counting the local combined features with a file ID and a preset field as grouping objects comprises:
taking the file ID and the first word as grouping objects, and counting the occurrence frequency of each second word and the occurrence frequency after the duplication is removed;
taking the file ID and the first word as grouping objects, and counting the maximum value, the minimum value, the median, the standard deviation and the times after the duplication elimination of each first difference value;
and counting the occurrence times of each second word by taking the file ID and the second field as grouping objects.
8. The API-based malicious file detection method according to claim 1, wherein the step of modifying the model based on the feature engineering set and the preset threshold in step S104 comprises:
taking the characteristic engineering set and the file types corresponding to the files as the input of a model to carry out iterative learning, and outputting the probability of the file type corresponding to each file ID;
modifying the file type with the maximum probability value smaller than a preset threshold value and the original file type of unknown file into a pseudo label of normal to form a new data set;
and performing iterative learning for preset times by taking the new data set as the input of the model to finish the correction of the model.
9. The API-based malicious file detection method according to claim 1, wherein in step S101, the step of classifying the collected files comprises: and scanning the collected files through antivirus software, and confirming the file types according to the scanning results.
10. An API-based malicious file detection system, comprising:
the parameter confirmation module is used for classifying the collected files to confirm the file types, putting the files into a sandbox for operation, and recording parameters called when the files operate; the file category comprises known files and unknown files; the parameters include: file ID, API name, thread ID and API call sequence number;
the preprocessing module is used for preprocessing parameters called during the operation of the known file to serve as model training data;
the feature construction module is used for constructing a feature engineering set based on the preprocessed data, and the feature engineering set comprises: global features and local combined features;
the correction module is used for constructing a model and correcting the model based on the characteristic engineering set and a preset threshold;
and the detection module is used for detecting the acquired unknown file based on the corrected model so as to confirm whether the unknown file is a malicious file.
CN202110749396.XA 2021-07-01 2021-07-01 API-based malicious file detection method and system Active CN113378156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110749396.XA CN113378156B (en) 2021-07-01 2021-07-01 API-based malicious file detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110749396.XA CN113378156B (en) 2021-07-01 2021-07-01 API-based malicious file detection method and system

Publications (2)

Publication Number Publication Date
CN113378156A true CN113378156A (en) 2021-09-10
CN113378156B CN113378156B (en) 2023-07-11

Family

ID=77580639

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110749396.XA Active CN113378156B (en) 2021-07-01 2021-07-01 API-based malicious file detection method and system

Country Status (1)

Country Link
CN (1) CN113378156B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117193889A (en) * 2023-08-02 2023-12-08 上海澜码科技有限公司 Construction method of code example library and use method of code example library

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189139A1 (en) * 2012-12-28 2014-07-03 Microsoft Corporation Seamlessly playing a composite media presentation
US20160072833A1 (en) * 2014-09-04 2016-03-10 Electronics And Telecommunications Research Institute Apparatus and method for searching for similar malicious code based on malicious code feature information
US20160241560A1 (en) * 2015-02-13 2016-08-18 Instart Logic, Inc. Client-site dom api access control
CN109508545A (en) * 2018-11-09 2019-03-22 北京大学 A kind of Android Malware classification method based on rarefaction representation and Model Fusion
CN109543751A (en) * 2018-11-22 2019-03-29 南京中孚信息技术有限公司 Method for mode matching, device and electronic equipment based on multithreading
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN111368289A (en) * 2018-12-26 2020-07-03 中兴通讯股份有限公司 Malicious software detection method and device
CN111639337A (en) * 2020-04-17 2020-09-08 中国科学院信息工程研究所 Unknown malicious code detection method and system for massive Windows software
CN111723371A (en) * 2020-06-22 2020-09-29 上海斗象信息科技有限公司 Method for constructing detection model of malicious file and method for detecting malicious file
CN112241530A (en) * 2019-07-19 2021-01-19 中国人民解放军战略支援部队信息工程大学 Malicious PDF document detection method and electronic equipment
CN112464234A (en) * 2020-11-21 2021-03-09 西北工业大学 SVM-based malicious software detection method on cloud platform
CN112528284A (en) * 2020-12-18 2021-03-19 北京明略软件***有限公司 Malicious program detection method and device, storage medium and electronic equipment
KR20210051669A (en) * 2019-10-31 2021-05-10 삼성에스디에스 주식회사 method for machine LEARNING of MALWARE DETECTING MODEL AND METHOD FOR detecting Malware USING THE SAME

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189139A1 (en) * 2012-12-28 2014-07-03 Microsoft Corporation Seamlessly playing a composite media presentation
US20160072833A1 (en) * 2014-09-04 2016-03-10 Electronics And Telecommunications Research Institute Apparatus and method for searching for similar malicious code based on malicious code feature information
US20160241560A1 (en) * 2015-02-13 2016-08-18 Instart Logic, Inc. Client-site dom api access control
CN109508545A (en) * 2018-11-09 2019-03-22 北京大学 A kind of Android Malware classification method based on rarefaction representation and Model Fusion
CN109543751A (en) * 2018-11-22 2019-03-29 南京中孚信息技术有限公司 Method for mode matching, device and electronic equipment based on multithreading
CN111368289A (en) * 2018-12-26 2020-07-03 中兴通讯股份有限公司 Malicious software detection method and device
CN112241530A (en) * 2019-07-19 2021-01-19 中国人民解放军战略支援部队信息工程大学 Malicious PDF document detection method and electronic equipment
KR20210051669A (en) * 2019-10-31 2021-05-10 삼성에스디에스 주식회사 method for machine LEARNING of MALWARE DETECTING MODEL AND METHOD FOR detecting Malware USING THE SAME
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN111639337A (en) * 2020-04-17 2020-09-08 中国科学院信息工程研究所 Unknown malicious code detection method and system for massive Windows software
CN111723371A (en) * 2020-06-22 2020-09-29 上海斗象信息科技有限公司 Method for constructing detection model of malicious file and method for detecting malicious file
CN112464234A (en) * 2020-11-21 2021-03-09 西北工业大学 SVM-based malicious software detection method on cloud platform
CN112528284A (en) * 2020-12-18 2021-03-19 北京明略软件***有限公司 Malicious program detection method and device, storage medium and electronic equipment

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
MAZEROFF, G ET AL: "Probabilistic suffix models for API sequence analysis of Windows XP applications", 《ELSEVIER SCI LTDTHE BOULEVARD》 *
MAZEROFF, G ET AL: "Probabilistic suffix models for API sequence analysis of Windows XP applications", 《ELSEVIER SCI LTDTHE BOULEVARD》, 31 December 2008 (2008-12-31) *
YUN, J ET AL: "MiGuard: Detecting and Guarding against Malicious Iframe through API Hooking", 《IEICE-INST ELECTRONICS INFORMATION COMMUNICATION ENGINEERSKIKAI-SHINKO-KAIKAN BLDG》 *
YUN, J ET AL: "MiGuard: Detecting and Guarding against Malicious Iframe through API Hooking", 《IEICE-INST ELECTRONICS INFORMATION COMMUNICATION ENGINEERSKIKAI-SHINKO-KAIKAN BLDG》, 31 December 2011 (2011-12-31) *
姜冲等: "基于运行时行为序列分析的恶意行为检测***", 《计算机工程设计》 *
姜冲等: "基于运行时行为序列分析的恶意行为检测***", 《计算机工程设计》, vol. 37, no. 3, 31 March 2016 (2016-03-31) *
荣俸萍等: "MACSPMD:基于恶意API调用序列模式挖掘的恶意代码检测", 《计算机科学》 *
荣俸萍等: "MACSPMD:基于恶意API调用序列模式挖掘的恶意代码检测", 《计算机科学》, no. 05, 15 May 2018 (2018-05-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117193889A (en) * 2023-08-02 2023-12-08 上海澜码科技有限公司 Construction method of code example library and use method of code example library
CN117193889B (en) * 2023-08-02 2024-03-08 上海澜码科技有限公司 Construction method of code example library and use method of code example library

Also Published As

Publication number Publication date
CN113378156B (en) 2023-07-11

Similar Documents

Publication Publication Date Title
CN110351301B (en) HTTP request double-layer progressive anomaly detection method
CN107145516B (en) Text clustering method and system
CN113011889B (en) Account anomaly identification method, system, device, equipment and medium
CN111723371B (en) Method for constructing malicious file detection model and detecting malicious file
CN114816909A (en) Real-time log detection early warning method and system based on machine learning
CN109829302B (en) Android malicious application family classification method and device and electronic equipment
CN111368289B (en) Malicious software detection method and device
CN111382783A (en) Malicious software identification method and device and storage medium
CN112884204A (en) Network security risk event prediction method and device
CN115189914A (en) Application Programming Interface (API) identification method and device for network traffic
CN113378156A (en) Malicious file detection method and system based on API
CN105468972B (en) A kind of mobile terminal document detection method
CN111988327B (en) Threat behavior detection and model establishment method and device, electronic equipment and storage medium
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN111460447B (en) Malicious file detection method and device, electronic equipment and storage medium
CN111414621B (en) Malicious webpage file identification method and device
CN113971283A (en) Malicious application program detection method and device based on features
CN116821903A (en) Detection rule determination and malicious binary file detection method, device and medium
CN107622201B (en) A kind of Android platform clone's application program rapid detection method of anti-reinforcing
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
US11868473B2 (en) Method for constructing behavioural software signatures
CN114491528A (en) Malicious software detection method, device and equipment
CN114398887A (en) Text classification method and device and electronic equipment
CN113722713A (en) Malicious code detection method and device, electronic equipment and storage medium
CN113298504A (en) Service big data grouping identification method and system based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant