CN115146279A - Program vulnerability detection method, terminal device and storage medium - Google Patents

Program vulnerability detection method, terminal device and storage medium Download PDF

Info

Publication number
CN115146279A
CN115146279A CN202210741646.XA CN202210741646A CN115146279A CN 115146279 A CN115146279 A CN 115146279A CN 202210741646 A CN202210741646 A CN 202210741646A CN 115146279 A CN115146279 A CN 115146279A
Authority
CN
China
Prior art keywords
program
code
vulnerability detection
slicing
vulnerability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210741646.XA
Other languages
Chinese (zh)
Inventor
胡玉鹏
关翔予
温杰凌
辛钰雯
齐园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202210741646.XA priority Critical patent/CN115146279A/en
Publication of CN115146279A publication Critical patent/CN115146279A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/427Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a program vulnerability detection method, terminal equipment and a storage medium, wherein a source program is converted into an intermediate code LLVM IR, and the program vulnerability detection method has the advantages of fine granularity, rich contained semantic information and easiness in expansion. The invention makes up the problem of insufficient semantics of the original code slicing algorithm by adding the control information and slices the intermediate code. By using an IR-based word embedding mode, according to the dependency among different instructions and the characteristics of instruction context, instruction-related information is kept as much as possible, and non-instruction information is discarded, so that the problems of complex data processing and easy semantic loss in the traditional word embedding mode are solved, and the vulnerability detection rate is greatly improved.

Description

Program vulnerability detection method, terminal device and storage medium
Technical Field
The invention relates to a software system bug detection technology, in particular to a program bug detection method, terminal equipment and a storage medium.
Background
Many network attacks originate from software vulnerabilities. Although much effort has been put into pursuing secure programming, various types of vulnerability detection systems have also been created, and software vulnerabilities remain unlikely to be addressed fundamentally.
The existing vulnerability detection scheme has two main defects: and the problems of high false alarm rate and the like caused by the fact that the path is insensitive, the code is sliced and the semantics are lost after the code is converted into a word vector.
On one hand, most of the existing vulnerability detection solutions rely on code slicing, and the code slicing processing aims to comprehensively extract semantics of vulnerability patterns, help networks to identify key codes, reduce learning difficulty of neural network models and improve learning effects. Program slicing is a program decomposition technique that abstracts the necessary syntax and semantics from a program. The existing method processes the source code into several segments through preprocessing, such as files, functions, and slices composed of interdependent statements, but most slicing algorithms currently have a key problem: the path is not sensitive. The same code slice can be extracted from the correct code and the bug code by using the existing slice generation method, the accuracy is kept at 0.5 no matter whether the detection result is bug or no bug exists, and the bug detection is useless, namely semantic loss in the data preprocessing process. Semantic loss in the data preprocessing process is a main disadvantage of the existing vulnerability detection framework. The main reason is that the path change of the statement can cause the control range of the statement to change, but the code slice in the existing framework can not capture the change. The reason is the following two aspects: (1) A control dependency is a rough description of the relationship between two statements (i.e., whether a dependency exists) and does not specify the path of the statement (i.e., whether it depends on a legal or illegal value); (2) The process of reorganizing sequences of statements, wherein a rough stacking may result in statements that are not within the same control range being directly adjacent to each other, thereby causing path insensitivity. Semantic information is of great importance in vulnerability detection, and more semantic information helps a neural network to detect vulnerabilities which cannot be found before. The existing code slicing method lacks semantic information extraction from the aspect of control dependence, and still stays at the aspect of code grammar, so that a model is not correctly trained in a training stage, the accuracy of a detection result in a detection stage is possibly reduced, and the false alarm rate of vulnerability detection is greatly increased.
On the other hand, what is more important is that most of current Vulnerability detection systems based on Deep Learning ignore the problem of semantic loss, li et al ([ 1 ]) Li Z, zou D, xu S, et al, vulDeeLocator. The word vector conversion process of the source code vulnerability detection method provided by the invention patent application (CN 113420296A) based on the Bert model and the C source code vulnerability detection of BilSTM depends on the Bert model, and the Bert model is known to be the best model for processing the natural language task at present. It is clearly not appropriate to treat the program code as a natural language task. It can be seen that most vulnerability detectors at present try to directly process the code or use syntax tree representation, or treat it as natural language, for vector representation. However, due to the structural nature of function calls, interchangeable orders of branches and statements, etc., in code, none of the existing approaches based on natural language processing are sufficient to fully understand program semantics.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a program vulnerability detection method, a terminal device and a storage medium aiming at the defects of the prior art, so as to improve the efficiency and accuracy of vulnerability detection.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a program vulnerability detection method comprises the following steps:
performing static analysis on a source program to obtain an intermediate code of the source program;
extracting key points which possibly cause a vulnerability, generating a slicing standard, slicing the intermediate code by using the slicing standard, and combining a forward slice and a backward slice to obtain a code segment of the program;
using IR2vec to embed words in the code segments of the program to obtain coded vectors;
and training a neural network by using the coded vector to obtain a vulnerability detection model.
The present invention takes advantage of the feature that LLVM IR (intermediate code) makes the high level language clearly mapped, the source program is converted into the intermediate code LLVM IR, and the method has the advantages of fine granularity, rich contained semantic information and easiness in expansion. The invention makes up the problem of insufficient semantics of code slicing by adding control information and slices the intermediate code. By using an IR-based word embedding mode, according to the dependency among different instructions and the characteristics of instruction context, instruction-related information is kept as much as possible, and non-instruction information is discarded, so that the problems of complex data processing and easy semantic loss of the traditional word embedding mode are solved, and the vulnerability detection rate is greatly improved.
In the invention, the concrete implementation process of performing static analysis on the source program and acquiring the intermediate code of the source program comprises the following steps: and analyzing the source code, and acquiring an intermediate code representation form corresponding to the source code by using a Clang command to obtain the intermediate code of the source program.
The specific implementation process for extracting the key points which may cause the vulnerability includes:
initializing a lexical unit set Y, and dividing a source program P into a plurality of functions, wherein the function sets are F; y is initialized to null;
for each function f i E.g. F, establishing an abstract syntax tree A i
Go through each lexical unit t j ,t j ∈A i Judgment of t j Whether four features in Z are matched; if matched, then Y { [ t ] } is added j And store Y; z = { Z = api ,z array ,z pointer ,z arithmetic }; wherein z is api ,z array ,z pointer ,z arithmetic Respectively marking four special marks of library function, array, pointer and expression;
and outputting Y, which can cause the key points of the loophole.
In the invention, the code segment acquisition process of the program comprises the following steps:
in a sentence s w Given a particular lexical unit
Figure BDA0003718243400000031
Collecting and defining statements s w Set of post-vertices of S s
Figure BDA0003718243400000032
Y is a key point which may cause a vulnerability;
for arbitrary sentences s s ∈S s Judgment s s Whether or not to pass
Figure BDA0003718243400000033
To s w Data or control dependency exists, if the dependency exists, the sentence s is extracted s Is a forward slice;
collecting and defining statements s w Is S p (ii) a For arbitrary sentences s p ∈S p Judgment s p Whether or not to pass
Figure BDA0003718243400000034
To s w Data or control dependence exists, if the dependence exists, the sentence s is extracted p Is a backward slicing;
combining the forward slices and the backward slices to obtain combined slices
Figure BDA0003718243400000035
Slicing the slices
Figure BDA0003718243400000036
As a code fragment of the program.
In the invention, the code segment acquisition process of the program comprises the following steps:
if it is sliced
Figure BDA0003718243400000037
One sentence in m q In the closed interval, m is q And m q Is inserted into the slice
Figure BDA0003718243400000038
Performing the following steps; wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003718243400000039
is empty, the initial state of (a) is empty,
Figure BDA00037182434000000310
the updating process of (a) includes: determine t j Whether or not to cooperate with z elseif ,z else ,z case If so, bind m cur And m pre ,m pre Initially empty; if not, will
Figure BDA00037182434000000311
Logging in
Figure BDA00037182434000000312
And m is cur Is assigned to m pre ;t j ∈A i ,t j Is the jth lexical unit, A i Is an abstract syntax tree; when t is j Matching eight control statements in Z, t j As A i A subtree of the root node, given as a ij ;m cur Is stored in ij The minimum and maximum row numbers of (a); z = { Z = if ,z elseif ,z else ,z for ,z while ,z dowhile ,z switch ,z case },z if ,z elseif ,z else ,z for ,z while ,z dowhile ,z swiych ,z case Is a control statement;
Figure BDA0003718243400000041
the slices are combined by the forward slices and the backward slices; for any control range
Figure BDA0003718243400000042
m b ∈M st Go through the traversal, make m at the beginning a [0]=m b [0]Taking m a [1]And m b [1]And assigning the maximum value to m a [1]Is updated
Figure BDA0003718243400000043
m a [0]Is the control range m a Lower limit value of (2), m a [1]Is the control range m a An upper limit value of (d);
is provided with
Figure BDA0003718243400000044
Initially empty, for in
Figure BDA0003718243400000045
Two functions of (1) v ,f ω If f is υ Call f ω Then will be
Figure BDA0003718243400000046
Assign to
Figure BDA0003718243400000047
Namely the final code slicing result of the source code program;
Figure BDA0003718243400000048
Figure BDA0003718243400000049
the updating process of (2) comprises: device set
Figure BDA00037182434000000410
Initially empty, for occurrence in
Figure BDA00037182434000000411
Two sentences s in (1) λ ,s μ If s is μ Inheritance of (2) the node is s λ Or s μ Is less than s λ Then will be
Figure BDA00037182434000000412
Then re-assign to
Figure BDA00037182434000000413
Slicing the program of the source program according to the corresponding relation between the source program and the intermediate code
Figure BDA00037182434000000414
Into program slice fragments of intermediate code.
The invention matches all the ranges which can be transmitted to the sentences and records the ranges in the slice by identifying the control range of each control sentence, thereby storing the positive or negative dependency relationship between the sentences in the slice.
In the invention, the neural network adopts a bidirectional circulation neural network model; and inputting the coded vectors into the bidirectional recurrent neural network model according to a random sequence. The invention performs data scattering on the vector after word embedding, so that the vector finally enters the bidirectional cyclic neural network model according to a random and disordered sequence, and the influence of the sequence of data input on network training is avoided. By increasing the randomness, the generalization performance of the network is improved, the phenomenon that the gradient is too extreme when the weight is updated due to the occurrence of regular data is avoided, and the over-fitting or under-fitting of the final model is avoided.
The method of the invention also comprises the following steps: inputting the source code to be predicted into the vulnerability detection model, and extracting the output result of the vulnerability detection model larger than a set threshold value, wherein the output result is a possible vulnerability row number.
As an inventive concept, the present invention also provides a terminal device, which includes a processor and a memory; the memory stores computer programs/instructions; the processor executes the computer programs/instructions stored by the memory; the computer program/instructions are configured to implement the steps of the method of the present invention.
As an inventive concept, the present invention also provides a computer storage medium having stored thereon a computer program/instructions; which when executed by a processor, perform the steps of the method of the present invention.
Compared with the prior art, the invention has the beneficial effects that:
1. the method can accurately capture the semantic information of the program, has lower false positive rate under the condition of ensuring higher accuracy, and simultaneously realizes granularity refinement, so that the vulnerability positioning accuracy reaches the row level;
2. the invention carries out slice processing on the IR codes, and the IR language has the characteristics of fine granularity, rich semantic information of the contained codes and the like, so that vulnerability analysis and detection are carried out, and the effect of using the IR codes is more accurate;
3. the method adds control information to make up for the insufficient semantics of the code slices, identifies the control range of each control statement, matches and records all ranges which can be transmitted to the statements in the slices, thereby storing the positive or negative dependency relationship between the statements in the path sensitive slices.
4. In order to avoid the traditional word embedding, the invention considers that the semantic relation is considered as important for program representation, and the surrounding context is not considered, so that the original semantics can be kept in the mapping process of word embedding as much as possible, and the non-instruction information is discarded. The invention adopts knowledge graph-based embedding, and a knowledge graph embedding model groups similar data points together by using a relationship, so that the coding of an IR element can be adaptive to the context of a statement environment, and a context-independent static characterization method superior to Word2Vec and the like is obtained.
Drawings
Fig. 1 is a diagram of a neural network structure according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a vulnerability detection method based on IR code slicing, which improves the efficiency and accuracy of vulnerability detection. It is more critical that word embedding be done by combining a representation learning method with the stream information to capture the syntax and semantics of the input program after the IR code has been sliced.
The embodiment of the invention comprises the following steps:
s1, performing static analysis on a source program by using a Clang tool to obtain an intermediate code representation form of the program;
the method comprises the following steps: and analyzing the source code, and acquiring an intermediate code representation form corresponding to the source code, namely the IR code (a ll file) by using a Clang command.
S2, extracting key points which possibly cause the loopholes, and generating a slicing standard; then slicing the intermediate code, and combining the forward slice and the backward slice to obtain a code segment of the program.
S2, extracting key points which possibly cause the vulnerability, and searching four syntax tree nodes which possibly cause the code vulnerability through an abstract syntax tree generated by a source program, wherein the concrete implementation process comprises the following steps:
identifying 4 special lexical units, locating the special lexical units before selecting one lexical unit to generate its corresponding backward and forward slice, wherein the four special lexical units are: library/API Function Calls (FC), array Usage (AU), pointer Usage (PU), and Arithmetic Expressions (AE).
Step S2 is realized by using an open-source C/C + + code analysis tool Joern, and the specific realization method comprises the following steps:
inputting: a program P = { s } consisting of several statements 1 …s }; grammatical feature set Z = { Z ] of four special lexical units api ,z array ,z pointer ,z arithmetic }; wherein z is api ,z array ,z pointer ,z arithmetic The method is characterized by comprising four special marks of library functions, arrays, pointers and expressions.
And (3) outputting: a special lexical unit set Y;
1: setting a program P containing s 1 ,s 2 ,…,s E.g. a sentence; setting grammar feature set Z = { Z) for four special lexical units api ,z array ,z pointer ,z arithmetic };
2: initializing a special lexical unit set Y;
3: dividing P into a group of functions, and setting the function set as F;
4: for each function f i E is subjected to one traversal for F, and each F i Establishing abstract syntax Tree A i
5: for each lexical unit t again j ∈A i Performing cyclic traversal to judge t j Whether there are matches with four features in Z; if matching, will Y { t } { [ T ] j And store Y (Y initial state is empty);
and 6, finally returning Y.
And the returned Y is the key point which can cause the vulnerability and is used as the slicing standard.
The specific implementation process of the code slice in the step S2 comprises the following steps:
by extracting forward and backward slices for each special token simultaneously, unidirectional slices may lead to semantic ambiguity or loss of semantics because one hole is detected by several statements in the context. Firstly, the invention converts the source code program into PDG, and generates forward and backward slices on the basis of data and control dependence according to the reachability analysis of the graph, and the advantages of the code slice of the invention are two points: firstly, finding sentences which are easy to be attacked by data dependence; and secondly, grammatical information is enriched by controlling dependence, so that accuracy reduction caused by semantic missing can be relieved under most conditions. The specific process is as follows:
inputting: a statement s in the program P w (ii) a Special lexical unit set Y central statement s w Generated a lexical unit
Figure BDA0003718243400000061
A program dependence graph G corresponding to the program P;
and (3) outputting: lexical unit
Figure BDA0003718243400000062
Corresponding forward and backward slices
Figure BDA0003718243400000063
1: in a sentence s w Given a particular lexical unit
Figure BDA0003718243400000064
2: according to containing statements s w One communicating branch G of w Collect and define its set of post vertices as S s
3: for any s s ∈S s By passing
Figure BDA0003718243400000065
The function is recursively traversed to determine s s Whether or not to pass
Figure BDA0003718243400000071
To s w Data or control dependence exists, if the dependence exists, the slice is extracted as a forward slice, and if the dependence does not exist, the circular search is continued; forwardSlice () is a function that extracts forward slices;
4: the same is true for the extraction of backward slices, collecting and defining the sentence s w The set of leading vertices of (1) is S p
5: for any s p ∈S p By passing
Figure BDA0003718243400000072
The function is recursively traversed to determine s p Whether or not to pass
Figure BDA0003718243400000073
To s w Data or control dependence exists, if the dependence exists, the slice is extracted as a backward slice, and if the dependence does not exist, the circular search is continued; backwardSlice ()As a function of the extraction of the backward slice;
6: finally, the forward slices and the backward slices are merged to obtain merged slices
Figure BDA0003718243400000074
The most key point of the embodiment of the invention is that the control information is added to make up for the semantic deficiency problem of the code slice. The control dependency between two statements is only a rough description and reorganizing the slices without taking the control dependency range into account results in no semantic separation between the two control ranges. The invention adds control information to make up the deficiency of semantic information in PDG:
(1) A corresponding abstract syntax tree is generated from the source code and nodes satisfying 8 syntax features are defined as key nodes, since here a clear control scope is involved.
(2) And calculating the maximum value and the minimum value of the line number in the subtree taking the key node as the root.
(3) In special cases (e.g., if, else) several adjacent control ranges are bound.
(4) The stack is used to correct the correspondence between the start node and the end node of the control range.
(5) The control range is inserted into the corresponding key node of the PDG, to form a complete dependency.
(6) And adjusting the statement relationship inside the functions according to the line numbers, and adjusting the statement relationship among the functions by calling the relationship.
Adding control information to make up the semantic information in PDG, the specific algorithm implementation method comprises the following steps:
inputting: a program P = { s } consisting of several statements 1 …s }; grammatical feature set Z = { Z } of eight control statements if ,z elseif ,z else ,z for ,z while ,z dowhile ,z switch ,z case }; by a sentence s w A lexical unit of the generation
Figure BDA0003718243400000075
Corresponding to lexical units
Figure BDA0003718243400000076
Is sliced into
Figure BDA0003718243400000077
And (3) outputting: lexical unit
Figure BDA0003718243400000078
Corresponding slicing result
Figure BDA0003718243400000079
1 setting a program P containing s 1 ,s 2 ,…,s E.g. a sentence; set Z = { Z = if ,z elseif ,z else ,z for ,z while ,z dowhile ,z switch ,z case Setting grammatical features for 8 control sentences;
dividing P into a group of functions, and setting the function set as F;
for each function f i Performing a cycle traversal for the e F, and performing a cycle traversal for each F i Establishing abstract syntax Tree A i
4 for each lexical unit t again j ∈A i Performing cyclic traversal to judge t j Whether eight features in Z match; if the features match, then t is added j As A i Is given as a ij Defining it as a key node;
5 calculation of a ij And stores the result in m cur
6: judgment of t j Whether features match z elseif ,z else ,z case If there is a match, m is added cur And m pre Binding together; if not matched, will
Figure BDA0003718243400000081
And is stored in
Figure BDA0003718243400000082
(
Figure BDA0003718243400000083
Is a set, the initial state is null), and m is set cur Assigned to m pre
7, matching the symbols (such as small brackets, big brackets and the like) after stacking the pair of features to obtain the range of the open and close intervals of the symbols, and storing the range in M st Gathering;
8 pair control range
Figure BDA0003718243400000084
m b ∈M st Go through the traversal, make m at the beginning a [0]=m b [0]Taking m a [1]And m b [1]And assigns it to m a [1]Finally, finally
Figure BDA0003718243400000085
Updating is carried out; m is a [0]Is the control range m a Lower limit value of (1), m a [1]Is the control range m a Is/are as follows on the upper part limiting the value;
for the
Figure BDA0003718243400000086
Repeating the step 8 until all the control ranges are traversed to obtain updated control ranges
Figure BDA0003718243400000087
9 to for
Figure BDA0003718243400000088
(herein, the
Figure BDA0003718243400000089
Obtained after updating in step 8
Figure BDA00037182434000000810
) Go through the slice if
Figure BDA00037182434000000811
One sentence in m q In the closed interval of (1), then m is q And m q Is inserted into the slice
Figure BDA00037182434000000812
The preparation method comprises the following steps of (1) performing;
for updated
Figure BDA00037182434000000813
And (4) executing the operation of the step 9 until all the control ranges are traversed.
10: device set
Figure BDA00037182434000000814
Initially empty, for occurrence in
Figure BDA00037182434000000815
Two statements s in λ ,s μ If s is μ Is s λ Or s μ Is less than s λ Then will be
Figure BDA00037182434000000816
Then re-assign to
Figure BDA00037182434000000817
To pair
Figure BDA0003718243400000091
The operation of step 10 is executed until the traversal is completed for the rest sentences in (1)
Figure BDA0003718243400000092
All statements in (1).
11: is provided with
Figure BDA0003718243400000093
The set is initially empty, for
Figure BDA0003718243400000094
Two functions of (1) υ ,f ω If f is υ Calling f ω Then will be according to f υ And f ω Get a set
Figure BDA0003718243400000095
In (1) correspond to
Figure BDA0003718243400000096
Will be provided with
Figure BDA0003718243400000097
And reassign to
Figure BDA0003718243400000098
Unlike other code slicing methods, the method of the embodiment of the present invention adds control information to the program dependency graph to make up for the lacking semantic information. The method solves the problems that the control dependency relationship in the prior art is too rough, the range of the dependency relationship can not be captured, and the details of the dependency relationship can not be captured.
Finally slicing the program of the source program according to the corresponding relation between the source program and the intermediate code
Figure BDA0003718243400000099
Into program slices of intermediate code. After the IR code is sliced, the IR code basic block logic does not change, and slicing simply deletes code statements that are not relevant to the slicing criteria.
The building blocks of LLVM IR include instructions, basic blocks, functions, and modules. Each instruction contains an opcode, a type, and an operand, and each instruction is of a static type. The basic block is the largest sequence of LLVM instructions without any jump. The set of basic blocks constitutes a function and the module is a set of functions. This hierarchy of LLVM IR representations helps to obtain embedding at the corresponding level of the program.
And S3, performing word embedding on the code segment subjected to IR program slicing by using IR2vec to obtain a coded vector (namely the vector corresponding to the vulnerability candidate in the figure 1). This distributed embedding is achieved by combining a representation learning method with the stream information to capture the syntax and semantics of the input program.
According to the characteristics, in order to enable semantic information of the sliced IR codes to be damaged as little as possible, the embodiment of the invention firstly introduces an IR2Vec word embedding technology based on IR, models the IR operation codes, operands and types as entities in a relational form and invents a vector representation method more suitable for fine-grained vulnerability location on the basis. The IR2Vec is not a traditional word embedding method oriented to natural language processing, but is highly combined with the LLVM technology, through learning the relation among operational characters, parameters and types and according to the composition structure of a program, a row representation, a block representation, a function representation and a program representation are built from bottom to top, finally, the LLVM IR language can be better analyzed, and internal logic and relation among instructions in the IR can be more fully acquired. The invention uses the flow perception coding mode in IR2Vec to embed words.
The step S3 comprises the following implementation steps:
1 before the IR code instruction word embedding, it is the most critical to generate the most primitive seed embedding vocabulary. Subsequent instruction embedding will be conducted under the direction of the seed embedding vocabulary. Firstly, mapping an LLVM-IR instruction to a code triple < h, r, t >, for an instruction, each instruction can use multiple triples to represent the internal and external relations of the instruction, and the content of the triples is specifically: the type of the current instruction (i.e., the relationship between the operator and the instruction), the relationship between the operator of the current instruction and the operator of the next instruction, and the relationship between the operator of the current instruction and its operands. These triple structures will preserve the relationship between the inside of the instruction and the instruction as much as possible, and will be used as input when embedding the training generation seed into the vocabulary. Feature embedding is next performed by TransE. TransE is a knowledge graph model that can be used to characterize transformation learning for the triplet < h, r, t >. TransE embeds h, r and t into the same high dimensional space, attempting to learn the representation using the relationship of the h + r ≈ t form, the output of the learning is a dictionary containing entity embedding, i.e., seed embedding vocabulary.
2: reading in IR code segments to be embedded, and constructing a series of program-related data structures in the memory.
2: and generating a function call graph according to the call relation of each function in the program, and acquiring the called function name of each function according to the function call graph.
3: next, an attempt is made to obtain for each function its word vector. And for the stream perception coding mode, according to the dependency relationship of a seed embedded vocabulary table and control streams among instructions, instruction word vectors are guided to be generated, the word vector of each basic block is formed by splicing the word vectors of each instruction in the basic block, and the word vector of a function is formed by splicing the word vectors of each basic block of the function. And generating word vectors of the functions in one step from the step (the word vector of each function is not formed by sequentially arranging the word vectors of the basic blocks, but formed by the word vectors of the basic blocks after topological sorting.
4: the word vectors of the functions are spliced to form word vectors of the currently transmitted IR file, namely coded vectors, and the vectors are used for further training and prediction.
Because the IR2Vec tool defaults to a matrix with the size of 300x 1 for word vectors generated by a single IR file, all word vectors transmitted into an IR code are compressed into 1 line, so that boundaries among instructions in the word vectors can be lost, and the accuracy of predicting line numbers of a neural network is influenced, therefore, in order to enable the IR2Vec to be better suitable for a fine-grained vulnerability detection system and achieve vulnerability line positioning accuracy, the embodiment of the invention designs and modifies an IR2Vec prototype, separates the word vectors of each instruction and does not combine all word vectors into a line of word vectors in a cage, lays a foundation for fine-grained vulnerability positioning, marks later-stage vulnerability line numbers, and provides guarantees for model training.
S4, constructing a bidirectional recurrent neural network model (as shown in figure 1), inputting a vector coded by embedding the words in the S3 into the neural network, training a vulnerability detection model, and continuously adjusting parameters according to a loss function to enable the model to achieve the optimal vulnerability detection effect;
the specific implementation process of the step S4 comprises the following steps:
1) The code slice is first marked. Since the model of the embodiment of the present invention is a kind of supervised learning, the marked source code needs to be obtained from the source data sets SARD and NVD, and the corresponding code slice needs to be marked. Specifically, after slicing the original vulnerability data set, the corresponding position of the vulnerability row number in the original data set in the IR slice is obtained, and the position is used as a training label.
2) In the embodiment of the invention, as shown in fig. 1, a BRNN neural network model is mainly used, and a training model can be mainly divided into two parts, wherein the first part is a traditional bidirectional cyclic neural network and comprises a plurality of BRNN layers, a random deactivation layer, a compact layer and an activation layer, and the second part comprises a multiplication layer, a maximum pooling layer and an average pooling layer.
3) In order to avoid the influence of the sequence of data investment on the network training. By increasing the randomness, the generalization performance of the network is improved, the phenomenon that the gradient is too extreme when the weight is updated due to the regular data is avoided, and the over-fitting or under-fitting of the final model is avoided. The embodiment of the invention performs data scattering on the vectors after word embedding, so that the vectors finally enter a network according to a random and disordered sequence.
4) Due to the fact that the data set has strong imbalance, repeated evaluation and parameter optimization of the model are needed finally;
s5, after the model training is finished, carrying out vulnerability detection on the model by using the trained model;
the specific implementation process of the step S5 comprises the following steps:
and aiming at a source code to be predicted, performing IR code generation, IR code slicing, word embedding and data preprocessing on the source code, then extracting a trained BRNN partial model to be used as a detection model, and transmitting the obtained vector file data into the detection model.
According to the embodiment of the invention, the output of the detection model is obtained, each time step corresponds to each code line, and the corresponding numerical value of each time step represents the possibility that the corresponding code line is predicted to be a bug, the threshold value is considered to be 0.5, for the code line with the numerical value larger than 0.5, the code line is considered to have the bug, and for the code line with the numerical value smaller than 0.5, the code line is considered to have no bug. The outputs are then sorted by their numerical size and the k number that is greater than a threshold (0.5) is extracted, which is the line number of the most likely bug in the IR code slice of the code. On the basis of the line numbers, the predicted IR code line number is mapped back to the source code line number based on the mapping relationship established by the debugging information in the IR code slice and the source code line number.
The embodiment of the invention adopts the IR code slicing method with enhanced semantics, ensures that more control semantic information can be reserved in the slicing process, and captures the grammar and semantics of an input program to embed words by combining the expression learning method with the stream information, thereby greatly improving the capability of vulnerability detection and the interpretability of codes and greatly improving the accuracy of vulnerability detection.
By the method, experiments are carried out on a desktop computer containing an Intel core i5-10600KF processor, a 16G memory and an NVIDA 1080 video card. The training samples collected by the experiment contain 4 different vulnerability categories, 189554 data are total, and in the training process, 20 times of iterative training are carried out on the training data. The experiment takes about 29 hours, the final verification accuracy rate is about 95%, the accuracy is about 90%, and F1 is about 90%.
After training is completed, the learned parameters are stored as model files and reloaded into the convolutional neural network during vulnerability detection. Taking the memory crash bug as an example, the experimental result is: the accuracy of the assay was 97.6% with an accuracy of 92.7%.
Through the analysis of the result, the source code file is converted into the IR code, then the code is sliced on the premise of keeping semantic information, then the IR2Vec is converted into the expression vector, and the neural network is utilized to extract vulnerability characteristics and learn parameters, so that the high accuracy of code vulnerability detection is finally realized.

Claims (10)

1. A program vulnerability detection method is characterized by comprising the following steps:
performing static analysis on a source program to obtain an intermediate code of the source program;
extracting key points which possibly cause a vulnerability, generating a slicing standard, slicing the intermediate code by using the slicing standard, and combining a forward slice and a backward slice to obtain a code segment of the program;
using IR2vec to embed words in the code segments of the program to obtain coded vectors;
and training a neural network by using the coded vector to obtain a vulnerability detection model.
2. The method for detecting program vulnerabilities according to claim 1, wherein the implementation process of performing static analysis on the source program and obtaining the intermediate code of the source program includes: and analyzing the source code, and acquiring an intermediate code representation form corresponding to the source code by using a Clang command to obtain the intermediate code of the source program.
3. The program vulnerability detection method of claim 1, wherein the specific implementation process of extracting key points which may cause a vulnerability comprises:
initializing a lexical unit set Y, and dividing a source program P into a plurality of functions, wherein the function sets are F; y is initialized to null;
for each function f i E.g. F, establishing an abstract syntax tree A i
Go through each lexical unit t j ,t j ∈A i Judgment of t j Whether four features in Z are matched;
if matched, then Y is matched with U { t } j And store Y; z = { Z = api ,z array ,z pointer ,z arithmetic In which z is api ,z array ,z pointer ,z arithmetic The method comprises the following steps of respectively marking four special marks of a library function, an array, a pointer and an expression; and outputting Y, which can cause the key points of the loophole.
4. The program vulnerability detection method according to claim 1, wherein the code segment obtaining process of the program comprises:
in a sentence s w Given a particular lexical unit
Figure FDA0003718243390000011
Collecting and defining statements s w Is S s
Figure FDA0003718243390000012
Y is a key point which can cause a vulnerability;
for arbitrary sentences s s ∈S s Judgment s s Whether or not to pass
Figure FDA0003718243390000013
To s w Data or control dependence exists, if the dependence exists, the sentence s is extracted s Slicing in a forward direction;
collecting and defining statements s w The set of leading vertices of (1) is S p (ii) a For arbitrary sentences s p ∈S p Judgment of
s p Whether or not to pass
Figure FDA0003718243390000021
To s w Data or control dependence exists, if the dependence exists, the sentence s is extracted p Is a backward slicing;
combining the forward slices and the backward slices to obtain combined slices
Figure FDA0003718243390000022
Slicing the slices
Figure FDA0003718243390000023
As a code fragment of the program.
5. The program vulnerability detection method according to claim 1 or 4, wherein the code segment obtaining process of the program comprises:
if it is sliced
Figure FDA0003718243390000024
One sentence in m q In the closed interval of (1), then m is q And m q Is inserted into the slice
Figure FDA0003718243390000025
Performing the following steps; wherein the content of the first and second substances,
Figure FDA0003718243390000026
Figure FDA0003718243390000027
is empty, the initial state of (a) is empty,
Figure FDA0003718243390000028
the updating process of (2) comprises: determine t j Whether or not to z elseif ,z else ,z case If so, bind m cur And m pre ,m pre The initial value is empty; if not, will
Figure FDA0003718243390000029
Logging in
Figure FDA00037182433900000210
And m is cur Is assigned to m pre ;t j ∈A i ,t j Is the jth lexical unit, A i Is an abstract syntax tree; when t is j Matching eight control statements in Z, t j As A i Is given as a ij ;m cur Is stored in ij The minimum and maximum row numbers of; z = { Z = if ,z elseif ,z else ,z for ,z while ,z dowhile ,z switch ,z case },z if ,z elseif ,z else ,z for ,z while ,z dowhile ,z switch ,z case Is a control statement;
Figure FDA00037182433900000211
the slices are combined by the forward slices and the backward slices; for any control range
Figure FDA00037182433900000212
m b ∈M st Go through the traversal, make m at the beginning a [0]=m b [0]Taking m a [1]And m b [1]And assigning the maximum value to m a [1]Is updated
Figure FDA00037182433900000213
m a [0]Is the control range m a Lower limit value of (2), m a [1]Is the control range m a An upper limit value of (d);
is provided with
Figure FDA00037182433900000214
Initially empty, for
Figure FDA00037182433900000215
Two functions of (1) υ ,f ω If f is v Call f ω Then will be
Figure FDA00037182433900000216
Assign to
Figure FDA00037182433900000217
Figure FDA00037182433900000218
Namely the final code slicing result of the source code program;
Figure FDA00037182433900000219
Figure FDA00037182433900000220
the updating process of (a) includes: device set
Figure FDA00037182433900000221
Initially empty, for occurrence in
Figure FDA00037182433900000222
Two sentences s in (1) λ ,s μ If s is μ Is s λ Or s μ Is less than s λ Then will be
Figure FDA0003718243390000031
Then re-assigned to
Figure FDA0003718243390000032
Slicing the program of the source program according to the corresponding relation between the source program and the intermediate code
Figure FDA0003718243390000033
Into program slice fragments of intermediate code.
6. The program vulnerability detection method of claim 1, wherein using IR2vec to perform word embedding on the code segment of the program to obtain the coded vector comprises:
reading a code segment to be subjected to word embedding;
generating a function call graph according to the call relation of each function in the program, and acquiring the name of a called function according to the function call graph;
according to the dependency relationship of control flow between a seed embedded vocabulary table and instructions, instruction word vectors are guided to be generated, the word vector of each basic block is formed by splicing the word vectors of each instruction, the word vector of a function is formed by splicing the word vectors of each basic block of the function, and the word vector of the function is generated;
splicing the word vectors of the functions to obtain coded vectors;
wherein the seed embedding vocabulary generating process comprises: mapping a program instruction to a code triple < h, r, t >, h, r, t respectively representing the type of the current instruction, the relationship between an operator of the current instruction and an operator of the next instruction, and the relationship between the operator of the current instruction and an operand thereof; and embedding h, r and t into the same high-dimensional space through a knowledge graph model to obtain a seed embedding vocabulary.
7. The program vulnerability detection method of claim 1, wherein the neural network employs a bidirectional recurrent neural network model; and inputting the coded vectors into the bidirectional recurrent neural network model according to a random sequence.
8. The program vulnerability detection method of claim 1, further comprising: inputting the source code to be predicted into the vulnerability detection model, and extracting the output result of the vulnerability detection model larger than a set threshold value, wherein the output result is a possible vulnerability row number.
9. A terminal device comprising a processor and a memory; the memory stores computer programs/instructions; the processor executes computer programs/instructions stored by the memory; the computer program/instructions configured to implement the steps of the method of one of claims 1 to 8.
10. A computer storage medium having stored thereon a computer program/instructions; characterized in that the computer program/instructions, when executed by a processor, implement the steps of the method of one of claims 1 to 8.
CN202210741646.XA 2022-06-28 2022-06-28 Program vulnerability detection method, terminal device and storage medium Pending CN115146279A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210741646.XA CN115146279A (en) 2022-06-28 2022-06-28 Program vulnerability detection method, terminal device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210741646.XA CN115146279A (en) 2022-06-28 2022-06-28 Program vulnerability detection method, terminal device and storage medium

Publications (1)

Publication Number Publication Date
CN115146279A true CN115146279A (en) 2022-10-04

Family

ID=83410741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210741646.XA Pending CN115146279A (en) 2022-06-28 2022-06-28 Program vulnerability detection method, terminal device and storage medium

Country Status (1)

Country Link
CN (1) CN115146279A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115495755A (en) * 2022-11-15 2022-12-20 四川大学 Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN115576586A (en) * 2022-11-15 2023-01-06 四川蜀天信息技术有限公司 Method for intelligently operating and maintaining server-side program of server
CN116302088A (en) * 2023-01-05 2023-06-23 广东工业大学 Code clone detection method, storage medium and equipment
CN117725422A (en) * 2024-02-07 2024-03-19 北京邮电大学 Program code vulnerability detection model training method and detection method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115495755A (en) * 2022-11-15 2022-12-20 四川大学 Codebert and R-GCN-based source code vulnerability multi-classification detection method
CN115576586A (en) * 2022-11-15 2023-01-06 四川蜀天信息技术有限公司 Method for intelligently operating and maintaining server-side program of server
CN116302088A (en) * 2023-01-05 2023-06-23 广东工业大学 Code clone detection method, storage medium and equipment
CN116302088B (en) * 2023-01-05 2023-09-08 广东工业大学 Code clone detection method, storage medium and equipment
CN117725422A (en) * 2024-02-07 2024-03-19 北京邮电大学 Program code vulnerability detection model training method and detection method
CN117725422B (en) * 2024-02-07 2024-05-07 北京邮电大学 Program code vulnerability detection model training method and detection method

Similar Documents

Publication Publication Date Title
CN111783100B (en) Source code vulnerability detection method for code graph representation learning based on graph convolution network
CN111639344B (en) Vulnerability detection method and device based on neural network
CN108446540B (en) Program code plagiarism type detection method and system based on source code multi-label graph neural network
CN107229563B (en) Cross-architecture binary program vulnerability function association method
CN115146279A (en) Program vulnerability detection method, terminal device and storage medium
CN113190849B (en) Webshell script detection method and device, electronic equipment and storage medium
CN112487812B (en) Nested entity identification method and system based on boundary identification
CN113064586B (en) Code completion method based on abstract syntax tree augmented graph model
CN112668013B (en) Java source code-oriented vulnerability detection method for statement-level mode exploration
CN105159828B (en) The context sensitivity detection method of source code level
CN113297580B (en) Code semantic analysis-based electric power information system safety protection method and device
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN112580346A (en) Event extraction method and device, computer equipment and storage medium
CN115455382A (en) Semantic comparison method and device for binary function codes
CN115617395A (en) Intelligent contract similarity detection method fusing global and local features
CN116150757A (en) Intelligent contract unknown vulnerability detection method based on CNN-LSTM multi-classification model
CN113868650B (en) Vulnerability detection method and device based on code heterogeneous middle graph representation
CN110737469A (en) Source code similarity evaluation method based on semantic information on functional granularities
CN116702160B (en) Source code vulnerability detection method based on data dependency enhancement program slice
CN117591913A (en) Statement level software defect prediction method based on improved R-transducer
CN117574898A (en) Domain knowledge graph updating method and system based on power grid equipment
CN116663018A (en) Vulnerability detection method and device based on code executable path
CN113076089B (en) API (application program interface) completion method based on object type
CN115859307A (en) Similar vulnerability detection method based on tree attention and weighted graph matching
CN115048929A (en) Sensitive text monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination