CN108509775B

CN108509775B - Malicious PNG image identification method based on machine learning

Info

Publication number: CN108509775B
Application number: CN201810128524.7A
Authority: CN
Inventors: 杨悉瑜; 翁健; 魏林锋; 杨悉琪; 潘冰; 张悦; 李明
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2018-02-08
Filing date: 2018-02-08
Publication date: 2020-11-13
Anticipated expiration: 2038-02-08
Also published as: CN108509775A

Abstract

The invention provides a malicious PNG image identification method based on machine learning, which belongs to the technical field of network space security and comprises the steps of firstly establishing a PNG image feature library and a digital steganography identification model; the method comprises the steps that a request for uploading a picture file is examined at a server side, feature matching identification is carried out according to a PNG image feature library, whether the PNG picture is legal or not is preliminarily identified, if the PNG picture is legal, a digital steganography identification model is called to mine whether the PNG picture has information hiding or not, and if the PNG picture is illegal or has information hiding, uploading is refused; the method comprises the steps of monitoring PNG picture format file data in a webpage transmission process at a client, carrying out feature matching identification according to a PNG image feature library, calling a digital steganography identification model to find whether information hiding exists in a PNG picture if the PNG picture format file data are legal, and forbidding access to picture resources if the PNG picture format file data are illegal or the information hiding exists. The invention can prohibit the uploading of illegal pictures at the server side and prohibit the access to illegal pictures at the client side, thereby strengthening the network security.

Description

Malicious PNG image identification method based on machine learning

Technical Field

The invention belongs to the technical field of network space security, and particularly relates to a malicious PNG image identification method based on machine learning.

Background

With the rapid popularization and application of networks, the rapid development of the digital technology and the security problem of network space, people gradually come into the visual field of people and pay more and more attention to the network.

On one hand, the browser is used as a main medium for people to obtain internet information, and the safety problem is not easy to be overlooked. In recent years, due to reasons such as stricter JavaScript examination, more and more web pages are implanted with web page advertisements of different shapes and colors, which induce users to click and access malicious links on a light basis, and bypass computers and network defense systems by attaching malicious software and malicious Dynamic Link library files (DLLs) to web page pictures on a heavy basis, thereby directly causing adverse effects such as virus infection and information leakage on personal computers and mobile devices of users.

On the other hand, websites are unlawfully controlled, and a large amount of data leakage events are layered, and as one of the frequent attack techniques, malicious codes such as a sentence Trojan horse are uploaded through a file uploading function to further control a server, the harm is not a little great. Detection and bypassing of uploaded malicious code is a defense and attack that never stops for both gaming parties. In recent years, an attacker starts to use an uploaded legal PNG picture to avoid detection of an intrusion detection system, malicious codes are hidden in a forged legal PNG picture through digital steganography technologies such as coding and LSB steganography, and once successful uploading is completed, the attacker can remotely control a website server by accessing and analyzing an elaborately constructed attack load hidden in the PNG picture, so that more destructive attempts and operation behaviors are performed, such as stealing website user privacy data, and the remote control website server serving as a puppet engine to launch denial of access attack (DoS) on other servers.

At the end, whether on a client such as a browser or a server deploying a website server, a problem to be solved urgently is to audit pictures in a webpage to prevent hidden malicious behaviors. The PNG format picture is widely used in the web page due to its characteristics of small size, lossless compression, optimized network transmission display, etc., and the PNG picture is also a good information hiding carrier and should be an object of focused research.

If the server side processes the picture file uploading request of the user, the legal picture uploading request can be efficiently and accurately identified, and whether the picture uses a digital steganography technology and contains a malicious attack load or not is analyzed; the client can filter the picture resources in the webpage when accessing the webpage resources, and forbids the picture resources suspected to contain the malicious program files from being downloaded by self, so that the malicious behaviors can be restrained from occurring from the source.

To this end, we introduce machine learning techniques and digital steganography techniques to solve this problem.

The application of machine learning technology is spread in various fields of artificial intelligence, and is a core technology of artificial intelligence. At present, the machine learning technology also plays a great role in the network space security field due to the characteristics of autonomous learning, efficient learning and accurate learning.

The implementation of machine learning has an inseparable relationship with three components: an environment, a learning portion, and an execution portion. The environment provides some information to the learning part of the system, the learning part uses the information to modify the knowledge base to improve the efficiency of the system execution part to complete the task, the execution part completes the task according to the knowledge base, and simultaneously feeds back the obtained information to the learning part.

The following describes in detail three factors that influence the design of the machine learning system, taking the identification of PNG images as an example:

information provided by the environment to the system: the knowledge base stores general principles that direct the execution of part of the actions, but the environment provides a wide variety of information to the system. If the quality of the information is high, the difference from the general principle is small, and the learning part is easy to process. If the system is provided with the disordered specific information for guiding the execution of specific actions, the system needs to delete unnecessary details after obtaining enough data, summarize and popularize the unnecessary details, form a general principle of guiding the actions and put the general principle into a knowledge base, so that the task of learning part is relatively heavy and the design is relatively difficult.

A knowledge base: the knowledge is expressed in various forms such as a head mark of the PNG image, a storage manner of the PNG image, an end mark of the PNG image, and the like. These representations each have their own characteristics, and the following 4 aspects are satisfied when selecting a representation:

(1) the expression ability is strong;

(2) the reasoning is easy;

(3) the knowledge base is easy to modify;

(4) the knowledge representation is easily scalable.

An execution section: is the core of the whole system, because the action of the execution part is the action of the learning part aiming for improvement. In the process of identifying the PNG image, the content of the learning part is continuously adjusted according to the identification result so as to improve the accuracy in execution.

Digital steganography is a security technique that embeds secret information into a digital medium without compromising the quality of its carrier. By processing the secret information through the digital steganography technology, the third party can not perceive the existence of the secret information and can not know the content of the secret information. Steganographic carriers include images, audio, video, etc. In recent years, digital steganography has become the focus of information security technology by virtue of the characteristics of changeability, strong secrecy and the like. Because each Web site depends on various multimedia resources, such as audio, video, images and the like, an attacker can hide attack behaviors in the multimedia by applying a digital steganography technology to malicious software and malicious attack loads and can easily bypass anti-malicious software detection, thereby causing greater potential threats.

Taking an image of a multimedia resource as an example, the classic digital image steganography technology comprises two aspects, namely steganography based on a space domain and steganography based on a transformation domain. The spatial domain-Based steganography mainly includes Least Significant Bit (LSB) steganography, and the Transform domain-Based steganography mainly relates to Discrete Cosine Transform (DCT) coefficients of an image, including Jsteg steganography, F5 steganography, outgauge steganography, Model-Based (MB) steganography, and the like.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a malicious PNG image identification method based on machine learning, which adopts a PNG image feature library to carry out feature matching identification, and judges whether the PNG image has hidden information or not by means of a digital steganography identification model, so that uploading of illegal images is prohibited at a server side, access to the illegal images is prohibited at a client side, and network security is enhanced.

The invention is realized by adopting the following technical scheme: a malicious PNG image identification method based on machine learning comprises the following steps:

step one, establishing a PNG image feature library and a digital steganography recognition model through machine learning;

step two, checking all requests for uploading picture files at the server, performing feature matching identification on the PNG picture by contrasting the PNG picture feature library established in the step one, and rejecting the uploading request if an illegal PNG picture format is found; otherwise, the PNG picture is subjected to primary identification, and the step three is carried out;

step three, for the PNG picture format file passing the primary recognition, calling the digital steganography recognition model established in the step one, mining whether the PNG picture has information hiding, and if so, rejecting the uploading request; if not, allowing the uploading request;

monitoring PNG picture format file data in the webpage transmission process at the client, performing feature matching identification on the PNG picture by contrasting the PNG picture feature library established in the step one, and if an illegal PNG picture format is found, forbidding to access the picture resource; otherwise, entering the step five;

and step five, calling the digital steganography recognition model established in the step one, mining whether the PNG picture has information hiding, regarding the picture with the information hiding, considering that malicious information is possibly hidden, and forbidding to access the picture resource.

Preferably, the PNG image feature library established in the step one is as follows: firstly, providing batch PNG images as training set data to be imported into a machine learning system; secondly, a PNG image feature recognition library is established, and the PNG image feature recognition library comprises the following feature information: (1) PNG header feature; (2) PNG end flag IEND block; (3) an IHDR block recording PNG image information; (4) an IDAT block storing actual image data; (5) storing the image redundancy information block; and finally, selecting a support vector machine model for feature learning aiming at the recognition library to complete the recognition and classification of the target.

Preferably, the digital steganography recognition model in the step one is established by combining shallow learning and deep learning: on one hand, a feature library is established based on the steganographic features of a classical steganographic algorithm for feature learning; on the other hand, based on the characteristic that the quality of the image after steganography is liable to change slightly, filtering pretreatment is carried out on the PNG image containing steganography information and the PNG image without steganography information by using a high-pass filter respectively, the image display characteristic is enhanced, the obtained residual image is used as a training set, then a convolutional neural network model is selected for transfer learning, and finally the probability that the digital steganography exists in the image is output.

Preferably, the characteristic library is established based on the steganographic characteristics of the classical steganographic algorithm for characteristic learning, and the method is characterized in that an RS analysis algorithm is selected for supervised learning of PNG images:

firstly, dividing an image input into a model to be trained into a plurality of image blocks with the same size, and scanning and arranging each image block into a pixel vector G ═ x₁,x₂,...,x_nAnd calculating the spatial correlation of each image block using the following formula:

wherein x_iThe gray value of each pixel is represented, and the smaller the f value is, the smaller the change of the gray value between adjacent pixel points is, and the stronger the spatial correlation of the image block is;

then, a non-negative inversion operation is applied to randomly extracted part of pixels of each image block, wherein an inversion function is defined as follows:

note F₁As a function of the pixel values 2i and 2i +1, i.e. as

Note F_-1As a function of the mutual change of the pixel values 2i-1 and 2i, i.e.

Note F₀The pixel values are in a constant relation;

calculating the ratio R of image blocks whose spatial correlation increases_MOr reduced proportion S of image blocks_M：

Similarly, a non-positive inversion operation is applied to randomly extracted partial pixels of each image block, and the proportion R of the image block with increased spatial correlation is calculated_-MOr reduced proportion S of image blocks_-M：

If the chaos degree is increased by applying non-positive inversion more than the chaos degree by applying non-negative inversion, setting a label for existence of LSB steganography characteristics for the PNG image; otherwise, setting the label as having no LSB steganography characteristic, and outputting;

and finally, forming training data by the input object and the expected output and establishing a learning mode, and estimating whether the LSB steganography exists in the new PNG image according to the learning mode.

Compared with the prior art, the invention has the following beneficial effects: the method introduces a machine learning technology and a digital steganography technology, establishes a PNG image feature library for feature matching identification, preliminarily judges whether the PNG image has the hidden of malicious information, and further judges whether the PNG image has the hidden information by means of a digital steganography identification model, so that uploading of illegal images is forbidden at a service end, access to the illegal images is forbidden at a client end, and network security is enhanced. The PNG image is supervised-learnt by selecting an RS analysis algorithm in the digital steganography recognition model, whether the LSB steganography characteristic exists in the image is judged by judging whether the chaos degree of the image is equivalent through the positive and negative overturning operation of an overturning function, then the deep learning and judgment are carried out on the probability of the digital steganography existing in the image by means of a convolutional neural network, the accuracy is high, the design of the whole model is simple, and the realization is easy.

Drawings

Fig. 1 is a flowchart of a malicious PNG image identification method based on machine learning according to an embodiment of the present invention;

fig. 2 is a frame diagram of a digital steganography recognition model in a malicious PNG image recognition method based on machine learning according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in detail below with reference to examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention is realized based on two parts, namely a server side and a client side. When the technical scheme of the invention is applied to the server, if each request for uploading the picture file is recorded and sequentially enters the PNG characteristic recognition library and the digital steganography recognition model as test set data to be matched, the behavior of a hacker for controlling the server by uploading the attack load can be effectively inhibited. When the technical scheme of the invention is applied to the client, if each webpage resource containing the picture is recorded and sequentially enters the PNG feature recognition library and the digital steganography recognition model as test set data to be matched, the behavior of the user equipment controlled by malicious behavior can be effectively restrained from the source.

Firstly, establishing a PNG image feature recognition library through a large number of PNG image recognition training; a digital steganography recognition model is established by combining shallow learning and deep learning for a PNG image subjected to steganography by using various digital steganography technologies. In the server environment, whether the file in the client file uploading process is a PNG image is identified according to a PNG image feature identification library, if the file is confirmed to be the PNG image, the file is preliminarily determined to be legal and is subjected to next detection, and if the file is confirmed not to meet the PNG image format requirement, the file is considered to be illegal to upload, and the uploading request is rejected. After the PNG image is preliminarily determined to be legal, a digital steganography recognition model is further used for detecting whether the PNG image has information hiding, if yes, the file is considered to be a suspected malicious file, and a client side uploading request is rejected; and if the request does not exist, the request is considered to have no malicious behavior, and the request is allowed to be uploaded. In a client environment, a browser real-time monitoring plug-in or other real-time monitoring tools are used for monitoring data of webpage pictures (particularly named PNG images) browsed by a user in real time, a PNG image feature recognition library is used for carrying out image feature recognition, and if abnormal images are found (namely the result is the PNG images which do not meet the standard after machine recognition), the user is forbidden to access the image resources; if no image abnormality is found, further detecting whether the image has information hiding by using a digital steganography recognition model, and if the image has information hiding, forbidding a user to access the image resource; if not, the user can normally access the picture resource. As shown in fig. 1, the method specifically comprises the following steps:

step one, a PNG image feature library and a digital steganography recognition model are established through machine learning.

For the establishment of the PNG image feature library, the uniformity of the PNG image format is considered, so that only shallow learning is adopted: firstly, batch PNG images are provided as training set data to be imported into a machine learning system. Secondly, a PNG image feature recognition library is established, and the PNG image feature recognition library comprises the following feature information: (1) PNG header feature; (2) PNG end flag IEND block; (3) an IHDR block recording PNG image information; (4) an IDAT block storing actual image data; (5) store the image redundancy information block (e.g., the tExt block), etc. And finally, performing feature learning aiming at the manually designed recognition library, and considering that the learning aims at completing recognition and classification of the target, selecting a Support Vector Machine (SVM) for supervised learning.

For the establishment of the digital steganography recognition model, in consideration of the characteristics that in addition to some classical steganography algorithms, the steganography algorithm based on the transformation of the classical steganography algorithm or the independent design is difficult to detect, the method adopts a mode of combining shallow learning and deep learning:

on one hand, a feature library is established based on the hidden writing features of a classic hidden writing algorithm for feature learning, wherein the classic hidden writing algorithm refers to a hidden writing algorithm in a space domain, such as Least Significant Bit (LSB) hidden writing. Considering that the RS (regular and Singular groups method) analysis algorithm detects the secret information based on the change of smoothness of the image before and after steganography, the random LSB steganography algorithm (i.e. the secret information selects the least significant bits of the image in a random order for steganography) is very robust, so the RS analysis algorithm is selected to perform Supervised learning (Supervised learning) on the PNG image, which is as follows:

firstly, dividing an image input into a model to be trained into a plurality of image blocks with the same size, and scanning and arranging each image block into a pixel vector G (x) in a Zigzag mode₁,x₂,...,x_nAnd calculating the spatial correlation of each image block using the following formula:

wherein x_iThe gray value of each pixel is represented, and the smaller the f value is, the smaller the gray value change between adjacent pixel points is, and the stronger the spatial correlation of the image block is.

Then applying a non-negative inversion (F) to randomly decimated partial pixels of each image block₁And F₀) Operation, wherein the roll-over function is defined as follows:

note F₁As a function of the pixel values 2i and 2i +1, i.e. as

Note F₀Is a pixel value invariant relationship.

Calculating the proportion of image blocks whose spatial correlation increases(as R)_M) Or reduced proportion of image blocks (denoted as S)_M)：

(R_M+S_M≤1)

Also, a non-positive inversion (F) is applied to randomly decimating a portion of the pixels for each image block_-1And F₀) Operation of calculating the proportion (denoted R) of image blocks whose spatial correlation increases_-M) Or reduced proportion of image blocks (denoted as S)_-M)：

(R_-M+S_-M≤1)

Statistically, if the image is not subjected to LSB steganography, then performing non-negative inversion or non-positive inversion on the image would destroy the spatial correlation of the image blocks to the same extent, i.e. increase the chaos of the image blocks equally, and there is R at this time_M≈R_-M,S_M≈S_-MAnd R is_M＞S_M,R_-M＞S_-M。

Therefore, if the increase of the degree of disorder caused by applying the non-positive inversion operation to the image is larger than the increase of the degree of disorder caused by applying the non-negative inversion operation, the PNG image is considered to have the LSB steganography very likely, and the label is set to have the LSB steganography characteristic; otherwise, setting the label as the LSB steganography characteristic does not exist, and outputting. Finally, the input object (PNG image) and the expected output (whether LSB steganography characteristics exist) form training data, a Learning mode (Learning mode) is established, and whether LSB steganography exists in the new PNG image or not is presumed according to the Learning mode.

On the other hand, based on the characteristic that the quality of the image after steganography is liable to have slight change, firstly, respectively using a high-pass filter to carry out filtering pretreatment on the PNG image containing steganography information and the PNG image not containing steganography information, enhancing the image display characteristic, and taking the obtained residual image as a training set; considering the superiority of the Convolutional Neural Network model in spatial mapping, which is suitable for processing images, and the migratory learning helps to reduce the requirement for constructing Neural Network data in the case of insufficient data amount, a Convolutional Neural Network (CNN) model based on improvement of Lionel bridge et al is selected for the migratory learning, and the main idea is as follows:

the convolutional neural network model pre-trained by Lionel Pibre and the like is used as a feature extraction operator, the last layer of the convolutional neural network is changed into a classifier of the convolutional neural network, and then the weights of other layers are fixed and the whole convolutional neural network is trained.

Referring to fig. 2, the convolutional neural network model structure is as follows:

inputting: all pixel point values of the processed residual image;

the characteristic structural layer: using a pre-trained model as a feature extractor;

a classifier: including a Connected Fully Connected Layer (full Connected Layer) and a classification function (softmax);

and (3) outputting: the probability of digital steganography of the image; when the output probability is greater than 0.8, the image is considered to have digital steganography.

The classifier is constructed by using an Image Quality Metrics (IQM) based blind detection method proposed by Avcibas, and specifically comprises the following steps:

1. feature vectors are selected by defining various measures of image quality, where Analysis of Variance (ANOVA) techniques are used in order to extract more vivid features; taking the Minkowsky feature as an example, the norm of the dissimilarity of two images can be represented by the Minkowsky average of the pixel differences taken spatially and then in chromaticity (i.e., over the entire band):

where γ is 1 or M_γDenotes the absolute average error, when γ is 2, M_γRepresenting mean square error, C_k(i, j) represents the multispectral components of the normal image at pixel location i, j and pixel k,

representing the multispectral components of the steganographic image at pixel locations i, j and pixel k, with N representing the total number of image pixels;

2. the selected IQM (Image Quality Metrics) forms a multi-dimensional feature space in which normal images are more distinguishable from stego images;

3. after a proper feature set is selected, a multiple linear regression model is established on a large amount of experimental data, and a classifier for distinguishing normal images from steganographic images is established on the basis of the regression model.

Step two, checking all requests for uploading picture files at the server, firstly carrying out decoding pretreatment on data, then carrying out feature matching identification on the PNG picture by comparing with the PNG picture feature library established in the step one, and if an illegal PNG picture format is found, rejecting the uploading request; otherwise, the PNG picture is subjected to primary identification, and the step three is carried out.

In this step, the request for uploading the picture file is examined, and the examination information includes the following: (1) file suffix name; (2) content style-type declared by message header of HTTP message; (3) whether the transmission content is encoded; (4) whether the transmission content is legitimate.

Step three, for the PNG picture format file passing the primary recognition, calling the digital steganography recognition model established in the step one, mining whether the PNG picture has information hiding, and if so, rejecting the uploading request; if not, the upload request is allowed.

Monitoring PNG picture format file data in the webpage transmission process in forms of real-time monitoring plug-in of a browser and the like at a client, preprocessing the data such as decoding, performing feature matching identification on the PNG picture by referring to the PNG picture feature library established in the step one, and forbidding to access the picture resource if an illegal PNG picture format is found; otherwise, go to step five.

The method comprises the steps that a client monitors webpage PNG image data, specifically, whether information hiding exists in the PNG image data or not is monitored, and the condition of malicious links with implicit inducibility of pictures is not considered.

And step five, a synchronization step three, calling the digital steganography recognition model established in the step one, mining whether the PNG picture has information hiding, regarding the picture with the information hiding, considering that malicious information is possibly hidden, and forbidding to access the picture resource.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A malicious PNG image identification method based on machine learning is characterized by comprising the following steps:

step five, calling the digital steganography recognition model established in the step one, mining whether the PNG picture has information hiding, regarding the picture with the information hiding, considering that malicious information is possibly hidden, and forbidding to access the picture resource;

in the second step, the request for uploading the picture file is examined, and the examination information comprises the following information: (1) file suffix name; (2) content style-type declared by message header of HTTP message; (3) whether the transmission content is encoded; (4) whether the transmission content is legitimate.

2. The method for identifying malicious PNG based on machine learning according to claim 1, wherein the PNG image feature library is created in step one by the following process: firstly, providing batch PNG images as training set data to be imported into a machine learning system; secondly, a PNG image feature recognition library is established, and the PNG image feature recognition library comprises the following feature information: (1) PNG header feature; (2) PNG end flag IEND block; (3) an IHDR block recording PNG image information; (4) an IDAT block storing actual image data; (5) storing the image redundancy information block; and finally, selecting a support vector machine model for feature learning aiming at the recognition library to complete the recognition and classification of the target.

3. The malicious PNG image recognition method based on machine learning according to claim 1, wherein the digital steganography recognition model of the first step is established by combining shallow learning and deep learning: on one hand, a feature library is established based on the steganographic features of a classical steganographic algorithm for feature learning; on the other hand, based on the characteristic that the quality of the image after steganography is liable to have slight change, filtering pretreatment is respectively carried out on the PNG image containing steganography information and the PNG image without steganography information by using a high-pass filter, the image display characteristic is enhanced, the obtained residual image is used as a training set, then a convolutional neural network model is selected for transfer learning, and finally the probability that the digital steganography exists in the image is output;

the structure of the convolutional neural network model comprises:

inputting: all pixel point values of the processed residual image;

a classifier: the method comprises the steps of connecting a full connection layer and a classification function;

and (3) outputting: the probability of digital steganography of the image; when the output probability is greater than 0.8, the image is considered to have digital steganography;

the classifier is constructed using a blind detection method based on image quality metrics:

selecting a feature vector by defining a plurality of measures of image quality using an analysis of variance technique; the norm of the dissimilarity of the two images is represented by the Minkowsky average of the pixel differences taken spatially and then expressed in chroma:

the selected image quality metrics form a multi-dimensional feature space;

after a proper characteristic set is selected, a multiple linear regression model is established on a large amount of experimental data, and a classifier for distinguishing normal images from steganographic images is established on the basis of the regression model.

4. The machine learning-based malicious PNG image recognition method according to claim 3, wherein the characteristic library is established for characteristic learning based on the steganographic characteristics of the classical steganographic algorithm, and in order to select the RS analysis algorithm for supervised learning of the PNG image:

firstly, dividing an image input into a model to be trained into a plurality of image blocks with the same size, and scanning and arranging each image block into a pixel vector G ═ x₁，x₂，...，x_nAnd calculating the spatial correlation of each image block using the following formula:

note F₁As a function of the pixel values 2i and 2i +1, i.e. as

Note F₀The pixel values are in a constant relation;

If the increase of the chaos degree caused by applying the non-positive flip operation to the image is larger than the increase of the chaos degree caused by applying the non-negative flip operation, setting a label as having LSB steganography characteristics to the PNG image; otherwise, setting the label as having no LSB steganography characteristic, and outputting;