CN111881447B

CN111881447B - Intelligent evidence obtaining method and system for malicious code fragments

Info

Publication number: CN111881447B
Application number: CN202010594720.0A
Authority: CN
Inventors: 李炳龙; 张宇; 李媛芳; 佟金龙; 孙怡峰; 常朝稳; 王清贤
Original assignee: Henan Yunyan Technology Co ltd; Information Engineering University of PLA Strategic Support Force
Current assignee: Henan Yunyan Technology Co ltd; Information Engineering University of PLA Strategic Support Force
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2022-12-06
Anticipated expiration: 2040-06-28
Also published as: CN111881447A

Abstract

The invention belongs to the technical field of digital evidence obtaining, and particularly relates to an intelligent evidence obtaining method and system for malicious code fragments, wherein a code fragment training set and a code fragment testing set for training and testing are constructed by extracting the bottom data characteristics of a storage medium; training the set fully-connected neural network model by using data in the code segment training set, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment; and after feature extraction is carried out on the target code segment, inputting the target code segment into a fully-connected neural network model generated through training test, and obtaining an intelligent malicious code recognition result. The method can be used for identifying malicious code segments in storage media such as computer mobile phone tablets and evidence containers such as RAW, E01 and AFF, and has a good application prospect in digital evidence obtaining fields such as automatic analysis of evidence underlying data of criminal incidents.

Description

Intelligent evidence obtaining method and system for malicious code fragments

Technical Field

The invention belongs to the technical field of digital forensics, and particularly relates to an intelligent forensics method and system for malicious code fragments, which are suitable for detecting and forensics of the malicious code fragments in storage media in devices such as a magnetic disk, a U disk, a mobile hard disk, an iPad and a smart phone of a computer (including a notebook computer) and in forensics containers such as RAW, E01 and AFF.

Background

With the rapid development of mobile internet technology, digital crime incidents frequently occur, and the volume of digital evidences needing to be processed by judicial agencies such as public security and the like in the incident investigation process is increased greatly due to the continuous increase of the capacity of magnetic disk media and the continuous increase of the number of digital devices for storing crime incident information. According to the analysis report of 2019 digital evidence-taking capability of Texas judicial institution: the Federal Bureau of Investigations (FBI) in the united states has the best forensic laboratories, but has overstocked the volume of digital evidence for up to more than nine months, and has resulted in situations where the volume of final cases has to be reduced because a large amount of digital evidence cannot be effectively analyzed. In addition, since the crime evidence comes from different types of devices such as computers, smart phones, tablet computers, and even internet of things devices and wearable devices, these massive evidences have metadata information such as different operating systems and file systems, and thus cause great differences in crime evidence analysis. In addition, in order to ensure the integrity and repeatability of the digital criminal evidence analysis, the digital criminal evidence needs to store evidences in different devices in evidence containers such as AFF, E01, RAW and the like through a storage medium mapping technology, and the evidence data in the evidence containers is stored in an underlying binary format, which makes the evidence analysis more and more complicated. Therefore, in order to solve the big data nature, evidence difference and complexity of digital evidence analysis in digital criminal incidents, the automatic evidence collection analysis technology becomes a key research problem in the current digital evidence collection field. Aiming at the identification problem of malicious code segments in complex, heterogeneous and underlying massive evidence data in the digital criminal incident investigation, the automatic evidence obtaining detection problem of malicious code segments is explored from the underlying characteristics of evidence data storage by utilizing a deep learning theory and a deep learning model, so that the automatic evidence obtaining detection problem becomes a hot research direction of digital evidence obtaining detection.

Disclosure of Invention

Therefore, the malicious code segment intelligent forensics method and system provided by the invention can be used for extracting the bottom layer data characteristics of the storage medium, identifying malicious code segments in storage media such as a computer (including a notebook computer) and an Android smart phone (tablet) and forensics containers such as RAW, E01 and AFF, improving the identification effect of the malicious code segments and having a better application prospect.

According to the design scheme provided by the invention, the intelligent malicious code fragment evidence obtaining method comprises the following contents:

extracting the bottom data characteristics of a storage medium, and constructing a code segment training set and a code segment test set for training and testing, wherein the code segment training set and the code segment test set both contain normal code segments and malicious code segment data;

training the set fully-connected neural network model by using data in the code segment training set to adjust network model parameters, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment;

and after feature extraction is carried out on the target code segment, inputting the target code segment into a fully-connected neural network model generated through training test, and obtaining an intelligent malicious code recognition result.

As the intelligent malicious code fragment evidence obtaining method, further, evidence source data from a plurality of storage media are collected, the evidence source data are analyzed, malicious code fragment characteristics are extracted, and the malicious code fragment characteristics are normalized, wherein the storage media come from different devices and/or adopt different file system types.

As the intelligent evidence obtaining method for the malicious code segments, the method further comprises the steps of identifying the file type and the evidence storage container type according to the original evidence source in analyzing the evidence source data, and determining the original evidence source file system type or the evidence storage container type; analyzing the initial/end position of file data storage in the storage medium and the cluster size of the file data storage, and recording the initial/end position of the file data storage as the initial/end position of the malicious code segment, wherein the cluster size of the file data storage is the size of the malicious code segment; starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, and taking the hexadecimal data of the malicious code segment as the characteristics of the malicious code segment.

As the intelligent evidence obtaining method for the malicious code fragments, further, different devices comprise a disk and/or a portable device with a storage function; the different file system types comprise an android file system and/or a Linux file system and/or a Windows file system.

As the intelligent evidence obtaining method for the malicious code fragments, a training set and a test set are further constructed, labels are added to the data sample code fragments in a batch processing mode to distinguish normal code fragment data from malicious code fragment data, and the labeled data sample code fragments are disordered by a pseudorandom method to obtain code fragment data which are randomly sequenced and used for constructing the training set and the test set.

As the intelligent evidence obtaining method for the malicious code fragments, a fully-connected neural network model structure in a deep learning open source framework Tensflow is further utilized, and each neuron in the fully-connected neural network model structure has a connection relation with each neuron of front and rear adjacent connection layers.

As the intelligent malicious code segment evidence obtaining method, a back propagation training algorithm is further utilized to train a full-connection neural network model, and by setting cycle turns of a cycle, model parameters are saved and a current loss value is determined when the cycle turns in each cycle are met so as to adjust the network model parameters.

As the intelligent evidence obtaining method for the malicious code fragments, random parameter initialization is adopted in the adjustment of network model parameters, so that the parameters are subjected to normal distribution or uniform distribution, and different neurons in a network layer in the model are ensured to have different outputs for different inputs; in the adjustment of the network model parameters, a cross entropy loss function is used for searching for the optimal solution of the model, a loss function is obtained according to the predicted value and the actual value of the input model, and the model parameters are adjusted by calculating the gradient of the loss function.

As the intelligent evidence obtaining method for the malicious code segments, further, model complex indexes are introduced into a loss function, and index weight is set for each weight parameter so as to inhibit noise in training data; and selecting an exponential decay learning rate, dynamically adjusting the learning rate in the training process, and updating the learning rate every other round, wherein an updating formula adopts: new learning rate = learning rate initial value ×. Learning rate decay rate.

Further, the present invention also provides an intelligent malicious code fragment forensics system, comprising: a data preprocessing module, a training test module and a target identification module, wherein,

the data preprocessing module is used for extracting the bottom layer data characteristics of a storage medium or an evidence container, constructing a code segment training set and a code segment test set for training and testing and performing characteristic preprocessing on a target code segment to be analyzed, wherein the code segment training set and the code segment test set both contain normal code segments and malicious code segment data;

the training test module is used for training the set fully-connected neural network model by using the data in the code segment training set so as to adjust the parameters of the network model, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment;

and the target identification module is used for inputting the target code segment after feature extraction into the fully-connected neural network model generated through training test to obtain the intelligent identification result of the malicious code.

The invention has the beneficial effects that:

according to the method, the characteristics of the bottom layer data of the storage medium are extracted, so that the malicious code segments in the storage medium such as a computer (including a notebook computer) and an Android smart phone (tablet) and in the evidence obtaining container such as RAW, E01 and AFF can be identified; the extracted code segment data are converted into feature vector data consistent with the existing deep learning model, training and parameter adjustment are further carried out by utilizing a Tensoflow deep learning model network structure, a deep learning model suitable for code segment processing is obtained, and the method has a good application prospect in the digital evidence obtaining field such as automatic analysis of criminal event evidences.

Description of the drawings:

FIG. 1 is a schematic flow chart of an intelligent malicious code fragment evidence obtaining method in an embodiment;

FIG. 2 is a schematic diagram of an automatic evidence-obtaining detection framework for malicious code fragments in an embodiment;

FIG. 3 is a schematic diagram of malicious code fragment feature preprocessing in an embodiment;

FIG. 4 is a schematic diagram of the back propagation algorithm in the embodiment;

FIG. 5 is a schematic diagram of data set processing in an embodiment;

FIG. 6 is a schematic representation of a TFRecords file in an embodiment;

FIG. 7 is a schematic diagram of an automatically generated data flow diagram of the Tensoboard in an embodiment;

FIG. 8 is a graph showing the variation of the loss with the training process in the example;

FIG. 9 is a schematic diagram showing the accuracy rate variation trend along with the training process in the embodiment;

FIG. 10 is a schematic diagram of the evidence-obtaining test result in the example.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions.

The research of automatic evidence obtaining technology has already achieved preliminary research results at present. Scholars have explored the necessity and importance of highly automated digital forensics and analyzed the advantages of automated forensics. In addition, in order to improve the automaticity of Forensic analysis, a button-type automatic Forensic function has been added to a classic Forensic kit, such as EnCase, forensic ToolKit, and an automatic Forensic Browser, which have full-function Forensic kits, to allow a Forensic investigator to perform preliminary, even some complex investigation and analysis tasks by knowing which button to press. These popular tools attempt to make forensic investigators work easier and to promote automated forensic capabilities. And the forensic software with a single TraceHunter function can also provide the automatic forensics functions of association, interpretation and Windows registry analysis. In addition, the evidence classification technology in the field of digital evidence collection is a direction of rapid growth and high automation degree, and a plurality of works support the automatic classification function of the relevant evidence in computers and mobile phones. As they realize the benefits of rapid, automated, on-site intelligence acquisition. According to 2019 internet crime investigation report analysis published by U.S. FBI: the automatic evidence collection technology is beneficial to rapid and automatic criminal event analysis and becomes a key technology for reducing the depth analysis of digital evidence in digital investigation. However, compared research is conducted by scholars on manual investigation and automatic evidence classification, and research results show that in a more complicated network attack criminal investigation, for example, a criminal stores malicious codes in a network hard disk in a form of fragments or in a peer-to-peer network storage system, potential evidence detected by an automatic classification evidence-taking technology is missed due to the lack of overall knowledge of the malicious codes. In addition, malware threats are increasing, and have become a difficult point of digital forensic detection. According to the report of the McAfe laboratory, more than 6500 million new malware was added to the laboratory in the first quarter of 2019. Traditional malware detection mechanisms rely on extracting signature features in malware samples and storing these features in a database. However, a great deal of manual analysis is required for extracting characteristics of a malware sample, and the signature characteristic-based malicious code detection technology is difficult to effectively keep up with the rapid increase in the amount of malware, and the fundamental reason is that the malicious code signature scanning technology is only effective on known malware samples and is not effective on newly added unknown malware. Another classical approach is to detect malware based run-time behavior, which involves running malware samples and observing their run-time behavior. Although this approach can improve the detection of unknown malware, this approach is vulnerable to virtual machine escape technology malware. Moreover, the execution of suspicious malicious code requires a significant amount of time and computational resources. Due to the limitations of the two malicious code detection technologies, and the increasing anti-forensics means of fragmentation, encryption and the like in a large-capacity storage medium by criminals, the detection difficulty of the malicious codes is greater. Researchers turn to heuristic methods and train and learn the characteristics of the malicious software by utilizing a machine learning model, so that the detection accuracy is enhanced and the speed is increased. An embodiment of the present invention, as shown in fig. 1, provides an intelligent malicious code fragment forensics method, including the following contents:

s101, extracting the bottom layer data characteristics of a storage medium, and constructing a code segment training set and a code segment testing set for training and testing, wherein the code segment training set and the code segment testing set both comprise normal code segments and malicious code segment data;

s102, training the set fully-connected neural network model by using data in the code segment training set to adjust network model parameters, wherein input is a feature vector, and output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment;

s103, after feature extraction is carried out on the target code segment, the input value is input to train the tested fully-connected neural network model, and an intelligent malicious code recognition result is obtained.

Deep neural Learning (Deep Learning) is a branch of the field of machine Learning, an algorithm that attempts to perform high-level abstraction of data using multiple processing layers that contain complex structures or consist of multiple nonlinear transformations. At present, deep learning obtains breakthrough progress in several main fields of images, voice, machine translation and the like, and a great deal of research results are generated. However, to obtain a good deep learning model, a deep learning framework needs to be studied for each specific problem (such as image classification) and long-term tuning is performed (i.e., model parameters are optimized through training), which makes the application of the deep neural network learning method limited. Therefore, kaiser et al explores a unified deep learning model, i.e., a plurality of tasks of different types in different fields, different data modalities, such as speech recognition, image classification, machine translation, etc., are adaptively solved by constructing a model, and the performance on a specific task is not obviously lost or is close to the existing mainstream method. The model is mainly applied to the problems of image classification, voice recognition, machine translation and the like at present, and the problem of malicious code segment recognition is not discussed yet. TensorFlow is an open source software library which adopts a data flow graph and is used for numerical calculation, is a deep learning framework developed by Google corporation, is also one of the mainstream frameworks of deep learning at present, can realize classical algorithms such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and a Deep Neural Network (DNN), and is applied to the aspects of speech recognition, natural language processing, computer vision and the like. By using the TensorFlow platform, a neural network can be designed without starting from the beginning, and a desired network can be generated by directly calling an interface. Having been widely used in the leading industry and academia, many deep learning articles providing source code use TensorFlow to implement their models. Therefore, aiming at the identification problem of malicious code segments in complex, heterogeneous and underlying massive evidence data in the digital criminal incident investigation, the automatic evidence-obtaining detection problem of malicious code segments is explored from the underlying characteristics of evidence data storage, and becomes the hot point direction of digital evidence-obtaining detection. Further, in the embodiment of the present invention, a fully-connected neural network model structure in a deep-learning open-source framework tensrflow is used, and each neuron in the fully-connected neural network model structure has a connection relation with each neuron in front and back adjacent connection layers. Referring to the TensorFlow deep learning open source framework shown in FIG. 2, the first module is an automatic identification framework based on TensorFlow malicious code fragments. The deep learning network used by the module for solving the problem of disk sensitive sector identification is a Full Connection Network (FCN), namely, each neuron has a connection relation with each neuron of adjacent layers before and after, 4096-dimensional feature vectors are input, and normal or malicious prediction results are output. And the second module is used for training a deep learning model by utilizing a malicious code segment data training set, finely adjusting relevant parameters of the deep learning model, and learning and obtaining the abstract characteristics of the malicious code segments. And the third module is used for detecting and classifying the code segments to be detected through the trained hierarchical deep learning model. In the embodiment, a TensorFlow model of a deep learning open source frame is adopted, a fully-connected neural network model in the TensorFlow frame is utilized, the number of initial input nodes is 4096, and the number of model output nodes is classified and output of 2.

Evidence sources for digital event surveys originate from different devices and storage media of different file system types. The code fragment training set and the data set are important for building and evaluating a deep learning network model, and after media such as a disk are processed based on a malicious code feature preprocessing algorithm, a large number of data sample files can be obtained, and the result is a binary data file which is 4KB in size and contains a certain type of code fragments. There is currently no code fragment data set based on a storage medium such as a disk. As an intelligent malicious code fragment evidence obtaining method in the embodiment of the present invention, further, evidence source data derived from a plurality of storage media are collected, the evidence source data are analyzed, malicious code fragment features are extracted, and the malicious code fragment features are normalized, wherein the plurality of storage media are from different devices and/or adopt different file system types. Further, the different devices comprise magnetic disks and/or portable devices with storage functions; the different file system types comprise an android file system and/or a Linux file system and/or a Windows file system.

As an intelligent evidence obtaining method for malicious code segments in the embodiment of the invention, further, during analysis of evidence source data, identification of a file type and an evidence storage container type is performed according to an original evidence source, and the original evidence source file system type or the evidence storage container type is determined; analyzing the initial/end position of file data storage and the cluster size of the file data storage in the storage medium according to the file system characteristics of the storage medium or the evidence obtaining storage container principle and the like, and recording the initial/end position of the file data storage as the initial/end position of a malicious code segment, wherein the cluster size of the file data storage is the size of the malicious code segment; starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, and taking the hexadecimal data of the malicious code segment as the characteristics of the malicious code segment.

Referring to fig. 3, the file system type and the evidence storage container type are identified according to the original evidence, and the file system type or the evidence storage container type of the original evidence is determined. According to the file system characteristics of the storage medium or the principles of storage containers such as AFF and E01, the start/end position of file data storage in the storage medium and the cluster size of the file data storage are analyzed (the cluster is composed of a plurality of sectors, and the size of each sector is 512 bytes), the start/end position of the recordable file data storage is the start/end position of a malicious code segment in the application, and the cluster size of the file data storage is marked as the size of the malicious code segment. Starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, calling the hexadecimal data of the malicious code segment as a preprocessing feature of the malicious code, and taking the preprocessing feature as a direct input feature of a deep learning model framework. Because the size of the malicious code fragment corresponds to a storage unit of a 'file cluster' in a storage medium and is an integral multiple of a sector in the storage medium, the problem of less or more fragment data does not exist in the malicious code fragment characteristic preprocessing process, and the problem of malicious code fragment characteristic data complementing or cutting does not need to be considered. But the size of the malicious code segment has certain influence on deep learning network model training and actual detection. As the intelligent evidence obtaining method for the malicious code segments in the embodiment of the invention, a training set and a test set are further constructed, labels are added to the data sample code segments in a batch processing mode to distinguish normal code segment data from malicious code segment data, and the labeled data sample code segments are disturbed by a pseudo-random method to obtain code segment data which are randomly sequenced and used for constructing the training set and the test set.

As the intelligent malicious code segment evidence obtaining method in the embodiment of the invention, further, a back propagation training algorithm is used for training the fully-connected neural network model, and by setting cycle turns of the cycle, the model parameters are saved and the current loss value is determined when the cycle turns in each cycle are met, so that the network model parameters are adjusted. Further, in the adjustment of network model parameters, random parameter initialization is adopted, so that the parameters obey normal distribution or uniform distribution, and different neurons in a network layer in the model are ensured to have different outputs for different inputs; in the adjustment of the network model parameters, a cross entropy loss function is used for searching for the optimal solution of the model, a loss function is obtained according to the predicted value and the actual value of the input model, and the model parameters are adjusted by calculating the gradient of the loss function. Further, model complex indexes are introduced into the loss function, and index weight is set for each weight parameter so as to inhibit noise in training data; and selecting an exponential decay learning rate, dynamically adjusting the learning rate in the training process, and updating the learning rate every other round, wherein an updating formula adopts: new learning rate = learning rate initial value ×. Learning rate decay rate.

The back propagation training algorithm is a main module of a malicious code fragment recognition algorithm framework, and as shown in fig. 4, the algorithm flow may specifically be: and after the training process is started, if the model exists, the model is recovered, otherwise, the training cycle is directly started, model parameters are set to be stored once every 1000 times of training, and the current loss value is calculated and printed. The algorithm is added with a training model storage function, and aims to realize breakpoint continuous training and give a given round of training after the loss value tends to be stable. Two points need to be considered in the back propagation algorithm implementation. Firstly, a random parameter initialization method is adopted, and the purpose is to make parameters obey normal distribution or uniform distribution, ensure that different neurons in a network layer have different outputs for different inputs, and ensure that a network training process has a good convergence effect. Secondly, the training optimization method comprises the following steps: in a deep learning model, a Cross Entropy (Cross Entropy) loss function is adopted to find the optimal solution of the model, tensorFlow obtains the loss function according to the predicted value and the actual value of the input model, the gradient of the loss function is calculated, and the model parameters are adjusted according to the gradient. In addition, a regularization mechanism is introduced for improving the generalization capability of the malicious code segment automatic identification algorithm framework. And introducing a model complex index into the loss function, adding a weight to each weight parameter, and suppressing noise in training data, wherein the bias parameters in the model are not generally used. In addition, the setting of the learning rate has great influence on the training, the exponential decay learning rate can be selected, the learning rate is dynamically adjusted in the training process, the learning rate decay rate is calculated every other round, and the learning rate is updated: new learning rate = learning rate initial value ×. Learning rate decay rate. The effect of the sliding average is to record the average value of each parameter over a period of time, and the average value changes slowly like a shadow, so that the generalization of the model can also be increased. The running average is optimized for all parameters.

Further, an embodiment of the present invention further provides an intelligent malicious code fragment forensics system, including: a data preprocessing module, a training test module and a target recognition module, wherein,

the training test module is used for training the set full-connection neural network model by using the data in the code segment training set so as to adjust the parameters of the network model, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, the trained fully-connected neural network model is used for carrying out test output so as to judge whether the model input is a malicious code fragment;

and the target identification module is used for carrying out feature extraction on the target code fragments, inputting the values to train the tested fully-connected neural network model, and acquiring the intelligent identification result of the malicious codes.

In order to verify the effectiveness of the technical scheme in the embodiment of the invention, the following further explanation is made through specific experimental data:

at present, no code fragment data set based on storage media such as a disk exists, so that in a preprocessing stage, normal codes and malicious codes for training are selected to be about 150MB respectively, and the normal codes and the malicious codes are ensured to come from Android platforms, linux platforms and Windows platforms respectively and are inhibited on average in quantity. The data sets are respectively stored in two folders (normal and malware), and after the data sets are manufactured, the data sets are adjusted to finally contain 39944 normal code fragment files and 40056 malicious code fragment files, and the total number of sample files is 80000. The test set can be used to evaluate the effect of the trained model, and in this experiment, the test set is used to test the algorithm without separately designing a validation set. When the test set is made, 2500 training data of each of the two types are randomly selected, and 5000 training data are made, and 5000 new code fragment files are made again by the method, wherein 10000 new code fragment files are made.

And labeling the data sample code fragment file on the basis of the processing. After the classified data sample code segment files are obtained, adding labels to the data code segment files in a batch processing mode, wherein the specific algorithm process comprises the following steps:

(1) Traversing all file names in the normal directory, one line for each file name, adding a label '0' after the file name, and saving to label0.Txt, wherein the file represents a normal code fragment label.

(2) Traversing all file names in the malware directory, wherein each file name is in a row, adding a label '1' after the file name, and finally obtaining a label1.Txt file which represents a malicious code fragment label.

(3) And merging the normal code fragment tag file contents and the malicious code fragment tag file contents, scrambling the merged tag contents by using a pseudo-random method, randomly distributing tag '0' and tag '1' data to obtain 80000 lines of random sequence texts in total, and storing the 80000 lines of random sequence texts as train _ texts. The test set label file test _ labels.

In addition, in order to improve the automatic evidence obtaining operation efficiency of the malicious code fragments and reduce the time consumed by reading files in the training process, the TFRecord file is used for processing the training data set and the labels thereof (including the test set and the labels thereof), and a specific algorithm process is shown in fig. 5. According to the data set processing algorithm of fig. 5, two TFRecords files as in fig. 6 are obtained.

According to the automatic forensics detection data set of the malicious code segment, the size of the malicious code segment can be selected to be 4 kbytes according to the FAT32 file system format in the Windows system, and the size is also a default file storage basic unit (4K corresponds to the size of 8 sectors) in a current large-capacity disk. Combining a malicious code fragment preprocessing algorithm and a data set making algorithm, the method specifically comprises the following steps: (1) Preparing normal code and malicious code, wherein the number (total size) of the normal code and the malicious code is equivalent; (2) Respectively writing the two types of codes into a clean disk, and filling all 0 in the disk by using a WinHex tool before the disk; (3) Positioning a single program according to a disk file system directory table; (4) Reading program (normal program or malicious program) data by taking 4K as a unit, and storing the data as a file with the size of 4K, wherein the file name is as follows: unique ID + serial number of the program; (5) And reading each code data one by one, and storing the normal code and the malicious code separately, thereby facilitating the manufacture of a data set. (6) Files of 4 directories are generated, and are respectively a normal code segment training (train _ positive), a malicious code segment training (train _ virus), a normal code segment testing (test _ positive) and a malicious code segment testing (test _ virus).

The quality of the data set can directly influence the functional effect of the fully-connected deep learning network, because the training and learning process of the network is carried out according to the training data. If the fully-connected deep learning network is trained for a long time, loss reduction is slow or even not convergent, or accuracy is always low, in case the fully-connected deep learning network has no error, it may be a problem with the data set, which may include but is not limited to: the number proportion of the malicious code fragments to the normal code fragments is unbalanced, and the proportion of certain data is overlarge; the data set labels are not in a disorderly sequence, if the situation that the first half is all '0', namely the first half is all normal code fragments, and the second half is all '1', namely malicious code fragments exists; failure of the collected data or errors in label making. In order to improve the training effect of the full-connection deep learning network, aiming at the unbalanced problem of the proportion of malicious codes and normal codes in a data set, and the unbalanced problem of the code fragment labels without disturbing the sequence and the like, the unbalanced processing mechanism of the data set sample can be utilized, and the specific method comprises the following steps: in the embodiment of the invention, in the process of realizing the full-connection network for the first time, because the collected malicious programs are limited and have no balance proportion, the proportion of normal programs and malicious program cluster files is about 8, so that the following direct problems are that training data randomly selected during program training rarely have malicious sensitive information, a model constructed by a neural network has insufficient expression capability on the malicious sensitive information, and the expected function cannot be realized. Similarly, for the 2-class problem, if the labels are not in disorder, the training may be limited to a single selection round. In fact, in machine learning, it is often assumed that the number of training samples of each class is equal, i.e., the number of samples of each class is balanced. An "unbalanced" training sample may result in the training model being "overlooked" to a class with a higher number of samples, and the generalization capability of the model is affected by "overlooking" to a class with a lower number of samples. As an extreme example, if the ratio of normal samples to malicious samples in the training set of the classification problem is 99. It is assumed that if the model is used to identify and classify data with a normal program sample and malicious program ratio of 1.

TABLE 1 impact of unbalanced proportion samples on training

Number of positive and negative samples	Tendency to loss	Accuracy after 50000 rounds	Recognition result
				80000:10000	Only slowlyDescend	About 50 percent	Random
40000:40000	Decrease and gradually converge	>95％	Is basically correct

Similar to the case of about 50% accuracy in table 1, 50% accuracy is not meaningful in the 2-class problem, and the random uniform distribution result is theoretically 50% in the engineering implementation process. This occurs unexpectedly, but through the above analysis process, the effect of the unbalanced samples is known, and a relatively balanced data set is recreated: the number of normal program samples is equivalent to that of malicious program samples, and both the normal program samples and the malicious program samples are about 40000; the label files are disorderly in sequence and randomly rearranged. A similar situation does not occur again for the new data sample set.

Through testing and adjustment, the three-layer fully-connected neural network (FCN) can achieve the best evidence obtaining detection result. The back propagation hyper-parameter value of the fully-connected neural network set according to experience is as follows: the initial value of the learning rate is 0.1, the attenuation rate of the learning rate is 0.99, the regularization coefficient is 0.0001, and the sliding average of the parameters is 0.99. And meanwhile, training is carried out by utilizing a training data set, 200 groups of data are fed in each time, the training result is stored in every 1000 rounds, and 80000 rounds of training are carried out. The resulting inverse training algorithm data flow diagram is shown in fig. 7. The loss and accuracy during training (saved once per 1000 rounds of training) is shown in fig. 8 and 9. It can be seen from the figure that with continuous training, the loss and accuracy gradually tend to be stable, and the accuracy is more than 99% after the stability.

In addition, through the construction of an experimental five-layer fully-connected neural network (comprising three hidden layers), the data set is used for training and finding: the experiment result proves that a smaller loss value is calculated only for the first time after training is started, the loss value obtained by the subsequent training is 'nan', namely the loss value is too small to be calculated, and the experiment result proves that: theoretically, when the fully-connected neural network reaches five levels, the problem of gradient disappearance occurs in the training process. Although training can be normally carried out after the activation function is adjusted to relu (), the training time is greatly increased due to the increase of the network structure, because the parameters are exponentially increased, and the network model detection result generated by training is not as good as that of a three-layer fully-connected neural network. The possible reasons are: the number of network input parameters is too large, 4096 inputs in a sense inherently limits the network size.

The trained automatic malicious code fragment evidence obtaining detection algorithm can not only identify code fragments, but also directly identify target files and underlying storage fragment data in evidence obtaining containers such as RAW and AFF, and the like. The result of a test forensics detection run is shown in fig. 10, wherein the full name is filled with all the same contents, the malware is a malicious program segment, and the normal is a normal code segment. It can be seen that the automatic evidence-taking detection algorithm for the malicious code segments can accurately identify the data.

In addition, the VirusTotal is used for searching and killing malicious code segments in the test set. VirusTotal is a service developed by the independent IT security laboratory Hispasec sitemas. It uses a variety of antivirus engines. Analysis was performed on the test set code snippet sample file using VirusTotal, with the results shown in table 2:

table 2 data evidence-obtaining test results based on VirusTotal website test set

As shown in the data in table 2, the comprehensive detection results of different antivirus engines for the malicious code fragments in the test set are poor, which indicates that most of the existing antivirus engines are not suitable for the analysis processing of the malicious code fragments although being capable of detecting the entire malicious code. In addition, it is also described that when malicious analysis is performed on the underlying data in the storage medium in digital forensics investigation, including forew, AFF, and other forensics containers, a new tool needs to be developed, and the automatic forensics detection scheme for the malicious code fragments based on deep learning in the embodiment of the present invention can better solve the problem.

Unless specifically stated otherwise, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present invention.

Based on the foregoing system, an embodiment of the present invention further provides a server, including: one or more processors; a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the system as described above.

Based on the above system, the embodiment of the present invention further provides a computer readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the above system.

The device provided by the embodiment of the present invention has the same implementation principle and the same technical effects as those of the foregoing system embodiment, and for the sake of brief description, reference may be made to corresponding contents in the foregoing system embodiment where no part of the embodiment of the device is mentioned.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system and the apparatus described above may refer to the corresponding process in the foregoing system embodiment, and details are not described herein again.

In all examples shown and described herein, any particular value should be construed as merely exemplary, and not as a limitation, and thus other examples of example embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and system may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention or a part thereof which contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the system according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An intelligent malicious code fragment evidence obtaining method is characterized by comprising the following contents:

extracting the bottom data characteristics of the storage medium, and constructing a code segment training set and a code segment testing set for training and testing, wherein the code segment training set and the code segment testing set both comprise normal code segments and malicious code segment data;

training the set fully-connected neural network model by using data in the code segment training set to adjust network model parameters, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, the trained fully-connected neural network model is used for carrying out test output so as to judge whether the model input is a malicious code fragment;

after feature extraction is carried out on the target code segment, the target code segment is input into a full-connection neural network model generated through training and testing, and an intelligent malicious code recognition result is obtained;

collecting evidence source data from a plurality of storage media, analyzing the evidence source data, extracting malicious code fragment features, and normalizing the malicious code fragment features, wherein the storage media come from different devices and/or adopt different file system types;

in the process of analyzing the evidence source data, identifying the file type and the evidence storage container type according to an original evidence source, and determining the original evidence source file system type or the evidence storage container type; analyzing the initial/end position of file data storage and the cluster size of the file data storage in a storage medium, and recording the initial/end position of the file data storage as the initial/end position of a malicious code segment, wherein the cluster size of the file data storage is the size of the malicious code segment; starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, and taking the hexadecimal data of the malicious code segment as the characteristics of the malicious code segment.

2. The intelligent malicious code fragment evidence obtaining method according to claim 1, characterized in that the different devices comprise disks and/or portable devices with storage function; the different file system types comprise an android file system and/or a Linux file system and/or a Windows file system.

3. The intelligent malicious code fragment evidence obtaining method according to claim 1, wherein a training set and a test set are constructed, tags are added to data sample code fragments in a batch processing mode to distinguish normal code fragment data from malicious code fragment data, and the tagged data sample code fragments are scrambled by a pseudorandom method to obtain code fragment data which are randomly sequenced and used for constructing the training set and the test set.

4. The intelligent malicious code fragment evidence obtaining method according to claim 1, wherein a fully-connected neural network model structure in a deep learning open source framework TensorFlow is utilized, and each neuron in the fully-connected neural network model structure has a connection relation with each neuron of front and back adjacent connection layers.

5. The intelligent malicious code fragment evidence obtaining method according to claim 1 or 4, wherein a back propagation training algorithm is used for training a fully-connected neural network model, and by setting cycle turns of a period, model parameters are saved and a current loss value is determined when the cycle turns in each period are met so as to adjust the network model parameters.

6. The intelligent malicious code fragment evidence obtaining method according to claim 5, wherein random parameter initialization is adopted in adjusting network model parameters, so that the parameters obey normal distribution or uniform distribution, and different neurons in a network layer in the model are ensured to have different outputs for different inputs; in the adjustment of the network model parameters, a cross entropy loss function is used for searching for the optimal solution of the model, a loss function is obtained according to the predicted value and the actual value of the input model, and the model parameters are adjusted by calculating the gradient of the loss function.

7. The intelligent malicious code fragment evidence obtaining method according to claim 5, wherein model complex indexes are introduced into the loss function, and index weight is set for each weight parameter so as to suppress noise in training data; and selecting an exponential decay learning rate, dynamically adjusting the learning rate in the training process, and updating the learning rate every other round, wherein an updating formula adopts: new learning rate = learning rate initial value — learning rate decay rate.

8. An intelligent malicious code fragment forensics system, comprising: a data preprocessing module, a training test module and a target recognition module, wherein,

the data preprocessing module is used for extracting the bottom layer data characteristics of a storage medium or an evidence container, constructing a code segment training set and a code segment testing set for training and testing and performing characteristic preprocessing on a target code segment to be analyzed, wherein the code segment training set and the code segment testing set both comprise normal code segments and malicious code segment data;

the training test module is used for training the set full-connection neural network model by using the data in the code segment training set so as to adjust the parameters of the network model, wherein the input is a feature vector, and the output is a normal or malicious prediction result; aiming at the code fragment test set, carrying out test output by using the trained fully-connected neural network model so as to judge whether the model input is a malicious code fragment;

the target identification module is used for inputting the target code segment after feature extraction into a full-connection neural network model generated through training test to obtain an intelligent identification result of the malicious code;

in the process of analyzing the evidence source data, identifying the file type and the evidence storage container type according to the original evidence source, and determining the original evidence source file system type or the evidence storage container type; analyzing the initial/end position of file data storage in the storage medium and the cluster size of the file data storage, and recording the initial/end position of the file data storage as the initial/end position of the malicious code segment, wherein the cluster size of the file data storage is the size of the malicious code segment; starting from the initial position of the malicious code segment, reading the malicious code data in a hexadecimal format by taking the size of the malicious code segment as a reading unit, and taking the hexadecimal data of the malicious code segment as the characteristics of the malicious code segment.