CN102346830B - Gradient histogram-based virus detection method - Google Patents

Gradient histogram-based virus detection method Download PDF

Info

Publication number
CN102346830B
CN102346830B CN201110285716.7A CN201110285716A CN102346830B CN 102346830 B CN102346830 B CN 102346830B CN 201110285716 A CN201110285716 A CN 201110285716A CN 102346830 B CN102346830 B CN 102346830B
Authority
CN
China
Prior art keywords
virus
proper vector
histogram
carried out
target viral
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110285716.7A
Other languages
Chinese (zh)
Other versions
CN102346830A (en
Inventor
唐朝伟
严鸣
蒋阳
张雪臻
时豪
李超群
杨磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201110285716.7A priority Critical patent/CN102346830B/en
Publication of CN102346830A publication Critical patent/CN102346830A/en
Application granted granted Critical
Publication of CN102346830B publication Critical patent/CN102346830B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a gradient histogram-based virus detection method, which provides a detection method utilizing a virus file binary system representation form aiming at the requirements of the instantaneity and the accuracy which are gradually increased in the virus detection. Compared with the traditional virus detection method, the method disclosed by the invention is used for mainly and integrally operating a virus file, and achieves the purpose of increasing detection efficiency of malicious codes through the steps of: firstly, preprocessing the virus file; representing the virus file into a number matrix form which is suitable to be processed, obtaining characteristic vectors of the described virus file by utilizing HOG (Histogram Of Oriented Gradient) descriptors which can reflect edge characteristics; and finally, detecting through classifying the characteristic vectors.

Description

Method for detecting virus based on histogram of gradients
Technical field
The present invention relates to a kind of method for detecting virus, especially relate to a kind of method for detecting virus based on histogram of gradients.
Background technology
Internet has changed people life style and working method, has changed global economic structure, social structure, and the various aspects of human society have been produced to very far-reaching influence.But in internet high speed development, network security problem is also day by day serious.
Along with popularizing of internet, the harm of virus is day by day serious.Not only make enterprise and user suffer huge economic loss, and make national security face serious threat.More prevalent and various along with virus attack and destruction, information system security is faced with formidable challenges, and virus problems also becomes the key factor that affects socio-economic development and national development strategy.Therefore, dissect ultimate principle and the corresponding Prevention Technique of computer virus, the safe reliability of harden computer system is still the important topic of computer application field.
Virus detects as another important safety guarantee technology after the conventional security safeguard measures such as Prevention-Security system relaying fire wall, data encryption, in the work of guarantee information system safety, plays an important role.Virus detection techniques makes great progress at intelligent and distributed both direction at present.In recent years, data mining, artificial immunity, information retrieval, the technology such as fault-tolerant are also permeated and have been fused in virus detection system, thereby a new height has been pushed in the development that virus is detected to.Along with the development of internet, how virus detection system, facing to ever-increasing challenge, improves detection efficiency and has also become the task of top priority with accuracy.
Summary of the invention
The object of this invention is to provide a kind of method for detecting virus that improves detection efficiency.
To achieve these goals, the invention provides a kind of method for detecting virus based on histogram of gradients, comprise the following steps:
Using HOG(Histogram of Oriented Gradients, histograms of oriented gradients) feature describer extracts the step of target viral proper vector; Virus is carried out to the step of svm classifier; The step of identification virus characteristic vector.
The step that described use HOG feature describer extracts target viral proper vector comprises the following steps:
A1, target viral is carried out to pre-service, using the eight-digit binary number code element of target viral as a combination, carry out scale-of-two to metric conversion, metric code table is shown as to the form of a numerical matrix;
The gradient-norm value of A2, calculating target viral;
The proper vector of A3, extraction target viral;
A4, the proper vector of extracting is carried out to dimension-reduction treatment.
The described step that virus is carried out to svm classifier comprises the following steps:
B1, obtain the step of sample, described sample is combined into by the proper vector set of virus document and the set of eigenvectors of normal file;
B2, the sample obtaining is carried out to data normalization;
B3, the sample obtaining is carried out to classified calculating.
The step of described identification virus characteristic vector comprises the following steps:
C1, the proper vector of the proper vector of target viral and svm classifier is contrasted.
In sum, owing to having adopted technique scheme, the invention has the beneficial effects as follows: by virus document is carried out to pre-service, target viral representation of file is become to be applicable to the form of the numerical matrix of processing, then utilize HOG feature describer to obtain the proper vector of target viral file, finally viral by the proper vector of the proper vector of target viral and svm classifier being carried out to recently detecting, thus reach the object that improves malicious code detection efficiency.
Accompanying drawing explanation
Examples of the present invention will be described by way of reference to the accompanying drawings, wherein:
Fig. 1 is method flow diagram of the present invention;
Fig. 2 is rectangle histogram of gradients describer;
Fig. 3 is the area schematic that contains 2*2 unit;
Fig. 4 is histograms of oriented gradients example;
Fig. 5 is support vector schematic diagram;
Fig. 6 is svm classifier device and KNN classifier performance comparison diagram;
Fig. 7 is the ROC curve map of svm classifier device and KNN sorter.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.
As shown in Fig. 1, a kind of method for detecting virus based on histogram of gradients, comprises the following steps: use HOG feature describer to extract the step of target viral proper vector; Virus is carried out to the step of svm classifier; The step of identification virus characteristic vector.
The step of using HOG feature describer to extract target viral proper vector comprises the following steps:
A1, target viral is carried out to pre-service, using the eight-digit binary number code element of target viral as a combination, carry out scale-of-two to metric conversion, metric code table is shown as to the form of a numerical matrix;
The gradient-norm value of A2, calculating target viral;
The proper vector of A3, extraction target viral;
A4, the proper vector of extracting is carried out to dimension-reduction treatment.
The step of virus being carried out to svm classifier comprises the following steps:
B1, obtain the step of sample, sample is combined into by the proper vector set of virus document and the set of eigenvectors of normal file;
B2, the sample obtaining is carried out to data normalization;
B3, the sample obtaining is carried out to classified calculating.
The step of identification virus characteristic vector comprises the following steps:
C1, the proper vector of the proper vector of file destination and svm classifier is contrasted.
As shown in Figures 2 to 4, when HOG feature describer calculates, through steps A 1, obtain scan matrix, the Application of Splitting Window of scan matrix is become to intensive unified net point, comprise several regions, the number in region is determined by matrix, each district inclusion
Figure 2011102857167100002DEST_PATH_IMAGE001
individual data point divides the area into unit centered by each net point.Wherein each region block comprises
Figure 385456DEST_PATH_IMAGE002
unit, and each unit comprises data point, each unit also comprises
Figure 57877DEST_PATH_IMAGE004
individual orientation angle, be parameter, shown in Fig. 3 is the schematic diagram in region in the rectangle histogram of gradients of using in the present invention, and in the present embodiment, each district inclusion contains 2*2 unit, contains 8*8 data point in each unit.Owing to producing the translation spacing size of net point, be identical with the data point number comprising in each unit, therefore these descriptor region are all overlapping conventionally, all there is contribution most of data points unit to the HOG feature describer in a plurality of regions at place when calculating simultaneously, therefore need to be to the HOG feature describer standardization in this region.
At this, take matrix size as 128*64, fringe region size be 16*16 as example, introduce the computing method of the intensive unified grid element center point in zoning and unit:
Figure 601466DEST_PATH_IMAGE006
, wherein
Figure 2011102857167100002DEST_PATH_IMAGE007
the size of representing matrix,
Figure 525691DEST_PATH_IMAGE008
the size that represents fringe region.The substitution of matrix size and fringe region size can be obtained to vertical direction and be divided into (64-16)/8+l=7 point, horizontal direction is divided into (128-16)/8+l=15 point, the window of whole scan matrix is divided into totally 105 grid element center points, centered by these net points, calculate the histogram of gradients of unit, then the histogram of gradients of each unit is integrated and generated corresponding regional area histogram of gradients.
With function represent viral matrix, wherein
Figure 239436DEST_PATH_IMAGE010
represent respectively line number and row number, and impute subfunction and be
Figure DEST_PATH_IMAGE011
, there is following expression:
Figure 145075DEST_PATH_IMAGE012
, here
Figure DEST_PATH_IMAGE013
get 1,0 ,-1, take 5*5 matrix as example, above formula can be written as:
Figure 116574DEST_PATH_IMAGE014
, by above formula, known operator function h can be expressed as form:
Figure DEST_PATH_IMAGE015
, during computing, this operator is put on matrix and makes its center over against data point h (0,0), then in grid, each h value will multiply each other with respective counts strong point in matrix.When compute matrix gradient, can adopt one dimension template, as the Diagonal template of the sobel template of 3*3 and 2*2.The template adopting is in the present invention [1,0,1], and respectively along X-axis and Y-axis convolution, compute gradient then, obtains mould value and the direction of each data point gradient.In the present invention, corresponding gradient-norm value and the expression formula of direction are:
Figure 641227DEST_PATH_IMAGE016
,
Figure DEST_PATH_IMAGE017
, the size of matrix and nuclear operator is limited, when the marginal point of matrix is when carrying out computing with nuclear operator, it is supplemented to corresponding line number around, the value of supplementing row gets 0 or equal all boundary values, takes in the present invention to mend 0 measure.
After calculating gradient, before forming last histogram, need to use Gauss's window filtering to the gradient of each unit in region, Gaussian filter is that a class provides the linear smoothing wave filter of weights according to the shape of Gaussian function, and in a unit, Gauss provides weights by space length, the weights of self are maximum, decentering point is far away, and weights are less, as principle, is weighted filtering.
Standard deviation is
Figure 378851DEST_PATH_IMAGE018
monobasic Gaussian function be defined as follows:
Figure DEST_PATH_IMAGE019
,
Figure 393075DEST_PATH_IMAGE020
representative data is put the distance of region grid element center, matrix gradient is that data point is given after weights through gaussian filtering, in each unit, use the gradient direction of statistics with histogram data point, the scope of histogram of gradients is 0 °~180 °, wherein, every 20 ° of posts, 9 posts form the proper vector of 4*9=36 dimension altogether in each region altogether, for example 4 unit in each piece are built respectively to histograms of oriented gradients: histogrammic transverse axis represents direction gradient angle, from 0 to
Figure DEST_PATH_IMAGE021
be equally divided into 9 parts, every part of width is
Figure 851869DEST_PATH_IMAGE022
(bandwidth), the longitudinal axis represents gradient magnitude statistical value.
For the data point in a unit, according to the Gauss's weight of self, histogram is voted: according to the band under certain data point direction gradient angle, to this band, add the long-pending as statistical value of gradient magnitude and weight, for example the gradient angle of certain data point is
Figure DEST_PATH_IMAGE023
, it belongs to so
Figure 976951DEST_PATH_IMAGE024
arrive
Figure DEST_PATH_IMAGE025
this band, adds the statistical value of this band the gradient magnitude of this data point.
Because when a matrix is processed, most region block can be overlapping, so after the proper vector of histogram of gradients that can obtain region block through aforesaid operations, need to do standardization to describing the proper vector of region histogram, adopt in the present invention L2-Hys to carry out standardization to region gradient histogram vectors, computing method are as follows:
Figure 840521DEST_PATH_IMAGE026
,
Figure DEST_PATH_IMAGE027
for not carrying out standardized region HOG feature describer vector, finally the histogram of gradients HOG feature describer of All Ranges in matrix to be measured is combined, generate the histogram of gradients proper vector of describing matrix to be measured.
After extracting the HOG proper vector of virus document, proper vector is carried out to dimensionality reduction, thereby realize detection speed faster.In the present invention, use PCA(principal component analysis, principal component analysis (PCA)) reduce the dimension of proper vector.
Suppose that the original index that will carry out principal component analysis (PCA) has
Figure 25645DEST_PATH_IMAGE028
individual, be designated as
Figure DEST_PATH_IMAGE029
, total sample number is
Figure 158686DEST_PATH_IMAGE030
, sample matrix can be designated as
Figure DEST_PATH_IMAGE031
(
Figure 87459DEST_PATH_IMAGE032
,
Figure DEST_PATH_IMAGE033
).First need to calculate the correlation matrix of sample matrix after standardization
Figure 802606DEST_PATH_IMAGE034
(wherein ,
Figure 155702DEST_PATH_IMAGE036
), calculated characteristics equation then
Figure DEST_PATH_IMAGE037
, draw non-negative eigenwert
Figure 323509DEST_PATH_IMAGE038
Figure DEST_PATH_IMAGE039
, and pressed descending order and arrange.Each major component
Figure 993656DEST_PATH_IMAGE040
variance corresponding to characteristic of correspondence value
Figure DEST_PATH_IMAGE041
,
Figure 563309DEST_PATH_IMAGE041
be called again major component
Figure 277187DEST_PATH_IMAGE040
variance contribution, front
Figure 126764DEST_PATH_IMAGE042
individual major component
Figure DEST_PATH_IMAGE043
variance contribution ratio sum be
Figure 397339DEST_PATH_IMAGE044
, contribution rate of accumulative total.
As shown in Figure 5, when virus is carried out to svm classifier, need to first obtain sample, sample is in the present invention to form by the set of eigenvectors of virus document with the proper vector set of normal file, then carries out data requirement, each property value of proper vector is scaled to [1,1] or in [0,1] scope, in the present invention each property value is scaled to [0,1], in scope, the kernel function that svm classifier is conventional has linear kernel function
Figure DEST_PATH_IMAGE045
, polynomial kernel function
Figure 971857DEST_PATH_IMAGE046
, radial basis kernel function
Figure DEST_PATH_IMAGE047
with Sigmoid kernel function
Figure 114256DEST_PATH_IMAGE048
(wherein ,
Figure 123276DEST_PATH_IMAGE050
,
Figure DEST_PATH_IMAGE051
for nuclear parameter), select in the present invention radial basis kernel function as the kernel function of svm classifier device, K(Xi, Xj) represent the inner product value of high-dimensional feature space can to calculate support vector by inner product value, the formula that calculates support vector is
Figure 401942DEST_PATH_IMAGE052
,
Figure DEST_PATH_IMAGE053
,
Figure 536251DEST_PATH_IMAGE054
, wherein
Figure DEST_PATH_IMAGE055
the optimum solution of formula, when
Figure 165947DEST_PATH_IMAGE055
during > 0,
Figure 965276DEST_PATH_IMAGE056
for support vector, the distance between support vector is exactly maximum classifying distance, this apart from middle plane as classification lineoid.
The present embodiment data centralization total sample number is 202, and wherein normal procedure is 105,97 of Viruses, and all malicious codes derive from website http://vx.netlux.org, and normal code is to obtain on the computing machine of Windows XP system.Sorting algorithm adopts SVM, selects MATLAB version and the corresponding auxiliary function of LIBSVM software package, and experiment adopts five retransposing checkings.
As shown in Figure 6, Figure 7 when adopting 95% contribution rate of accumulative total as PCA, use respectively SVM and KNN as the detection effect of sorter, for intuitively, classification results is depicted as to ROC figure, area below curve is larger represents that sorter is more stable, experimental result shows, the detection efficiency of making sorter of SVM is better than KNN.
The present invention is not limited to aforesaid embodiment.The present invention expands to any new feature or any new combination disclosing in this manual, and the arbitrary new method disclosing or step or any new combination of process.

Claims (4)

1. the method for detecting virus based on histogram of gradients, is characterized in that comprising the following steps:
Use HOG feature describer to extract the step of target viral proper vector; Virus is carried out to the step of svm classifier; The step of identification virus characteristic vector; The step that described use HOG feature describer extracts target viral proper vector comprises the following steps:
A1, target viral is carried out to pre-service, using the eight-digit binary number code element of target viral as a combination, carry out scale-of-two to metric conversion, metric code table is shown as to the form of a numerical matrix;
The gradient-norm value of A2, calculating target viral;
The proper vector of A3, extraction target viral;
A4, the proper vector of extracting is carried out to dimension-reduction treatment.
2. the method for detecting virus based on histogram of gradients according to claim 1, is characterized in that the described step that virus is carried out to svm classifier comprises the following steps:
B1, obtain the step of sample;
B2, the sample obtaining is carried out to data normalization;
B3, the sample obtaining is carried out to classified calculating.
3. the method for detecting virus based on histogram of gradients according to claim 2, is characterized in that described sample is combined into by the proper vector set of virus document and the set of eigenvectors of normal file.
4. the method for detecting virus based on histogram of gradients according to claim 1, is characterized in that the step of described identification virus characteristic vector comprises the following steps:
C1, the proper vector of the proper vector of file destination and svm classifier is contrasted.
CN201110285716.7A 2011-09-23 2011-09-23 Gradient histogram-based virus detection method Expired - Fee Related CN102346830B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110285716.7A CN102346830B (en) 2011-09-23 2011-09-23 Gradient histogram-based virus detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110285716.7A CN102346830B (en) 2011-09-23 2011-09-23 Gradient histogram-based virus detection method

Publications (2)

Publication Number Publication Date
CN102346830A CN102346830A (en) 2012-02-08
CN102346830B true CN102346830B (en) 2014-03-05

Family

ID=45545499

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110285716.7A Expired - Fee Related CN102346830B (en) 2011-09-23 2011-09-23 Gradient histogram-based virus detection method

Country Status (1)

Country Link
CN (1) CN102346830B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077527B (en) * 2014-06-20 2017-12-19 珠海市君天电子科技有限公司 The generation method and device and method for detecting virus and device of Viral diagnosis machine
CN112183433B (en) * 2020-10-12 2024-02-23 水木未来(北京)科技有限公司 Characterization and quantification method for solid and hollow virus particles
CN113449304B (en) * 2021-07-06 2024-03-22 北京科技大学 Malicious software detection method and device based on strategy gradient dimension reduction
CN114647849B (en) * 2022-03-22 2024-06-25 安天科技集团股份有限公司 Method and device for detecting potential dangerous files, electronic equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101382984A (en) * 2007-09-05 2009-03-11 江启煜 Method for scanning and detecting generalized unknown virus
IL191744A0 (en) * 2008-05-27 2009-02-11 Yuval Elovici Unknown malcode detection using classifiers with optimal training sets
CN102142068A (en) * 2011-03-29 2011-08-03 华北电力大学 Method for detecting unknown malicious code
CN102169533A (en) * 2011-05-11 2011-08-31 华南理工大学 Commercial webpage malicious tampering detection method

Also Published As

Publication number Publication date
CN102346830A (en) 2012-02-08

Similar Documents

Publication Publication Date Title
Liu et al. Automatic malware classification and new malware detection using machine learning
Liu et al. Targeting ultimate accuracy: Face recognition via deep embedding
US8913798B2 (en) System for recognizing disguised face using gabor feature and SVM classifier and method thereof
CN102346830B (en) Gradient histogram-based virus detection method
CN104966090B (en) Realize the system and method that the vision word towards image understanding is generated and evaluated
CN104299003A (en) Gait recognition method based on similar rule Gaussian kernel function classifier
Zhang et al. Character proposal network for robust text extraction
Nanni et al. Combining face and eye detectors in a high-performance face-detection system
Gu et al. An advanced deep learning approach for safety helmet wearing detection
CN112241530A (en) Malicious PDF document detection method and electronic equipment
Yin et al. Accurate and robust text detection: A step-in for text retrieval in natural scene images
CN109934852B (en) Video description method based on object attribute relation graph
CN103246877A (en) Image contour based novel human face recognition method
Yuan et al. A review of recent advances in ear recognition
Ronao et al. Mining SQL queries to detect anomalous database access using random forest and PCA
CN101877065A (en) Extraction and identification method of non-linear authentication characteristic of facial image under small sample condition
Lin et al. Multi-class image classification based on fast stochastic gradient boosting
CN102609718A (en) Method for generating vision dictionary set by combining different clustering algorithms
Li et al. Face gesture recognition based on clustering algorithm
CN113225300B (en) Big data analysis method based on image
Shaharudin et al. Improved cluster partition in principal component analysis guided clustering
CN102096843A (en) Virtual sample-based KPCA (kernel principal component analysis) characteristic extraction method and mode identification method
Paul et al. Automatic adaptive facial feature extraction using CDF analysis
CN106933805B (en) Method for identifying biological event trigger words in big data set
JP6080580B2 (en) Pattern recognition device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140305

Termination date: 20190923