CN112131120B

CN112131120B - Source code defect detection method and device

Info

Publication number: CN112131120B
Application number: CN202011029454.3A
Authority: CN
Inventors: 笋大伟; 华嘉仪
Original assignee: Beijing Zhilian Anhang Technology Co ltd
Current assignee: Beijing Zhilian Anhang Technology Co ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2022-09-30
Anticipated expiration: 2040-09-27
Also published as: CN112131120A

Abstract

The invention discloses a source code defect detection method and a source code defect detection device, wherein the method comprises the following steps: acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data; training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model for judging whether the sensitive points have safety defects or not; inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model to determine whether a security defect exists in a sensitive point corresponding to the feature code segment. The invention can effectively improve the source code defect detection efficiency and accuracy.

Description

Source code defect detection method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a source code defect detection method and a source code defect detection device.

Background

There are many tools and systems for software code defect detection at home and abroad. Code detection tools and companies that are common abroad are Checkmarx, Parasoft, coverage, Fortify, etc. Checkmarx is a source code security detection product produced by Israel, which is designed specifically for identifying, tracking and repairing technical and logical security vulnerabilities of software source codes, and can detect multi-program security flaws, support multi-system platforms, program languages and development frameworks. Its advantages are self-defined rule, high integration, low speed and precision, and high cost. Parasoft corporation is the world-wide top software quality, service virtualization, and software lifecycle management solution provider. The product comprises c/c + + test and jtest, has complete testing functions, relatively low price and high cost performance, but supports insufficient language environment. The coverage is the next brand of Synopsys company, and comprises static code analysis and dynamic code analysis tools, provides fee-based services and free open-source products, and can be used by individuals and enterprises with different requirements. The defects are that the user experience of the page is poor, and the target defects in the list are not easy to find. The detectable language is relatively few and there are some unfiltered false positives for the tool. The Fortify SCA is a software source code security testing tool of a static white box under the flag of Hewlett-packard science and technology company, converts front-end languages (such as C/C + +, Java and the like) into an intermediate file NST through a language compiler, analyzes and clarifies calling relations, environments and the like among source codes, and then performs matching and searching with a rule base through an analysis engine to capture security vulnerabilities existing in the source codes and sort out an FPR result file. However, the usage cost of the Fortify SCA is high, and the false defect alarm rate of the Fortify SCA is high according to the feedback of some users.

The common code detection worker in China has a CoBOT, 360 code safety guarantee system, OpenRASP, Scantist SCA, pinpoint, DMSCA and the like. CoBOT (Kubo) is a first-money commercialized code security detection tool with intellectual property rights in China, which is created by the Beijing university software engineering center in combination with the research and development team of the Beijing software enterprise, breaks through the monopoly of foreign products in the fields of software defect detection and security vulnerability analysis, and is relatively popularized in military, government and academy of sciences with the help of the Beijing university platform. The 360-code safety guarantee system has higher internal awareness in the industry by means of propaganda of company brands, free antivirus software and the like, and has the capability of covering three major directions of source code defect detection, source code compliance detection and source code traceability detection. OpenRASP is a free and open-source self-protection product during running of an application launched by hundred degrees of security, is directly injected into a service of a protected application to provide real-time protection at a function level, can detect and protect unknown vulnerabilities without updating a policy and upgrading protected application codes, and is suitable for internet applications using open-source components in large quantities and financial applications developed by third-party integrators. Scantist SCA is a solution created by Shanghai-control Ann that combines source code with binary file analysis, which can assist users or enterprises in providing detailed analysis of third-party components used in applications. Pinpoint is developed by source parachute technology, is based on the technology proved by the fifth generation theorem of academic circles, is developed from 8 years of scientific research achievements of hong Kong science and technology university, understands software behaviors by analyzing source codes and binary codes of software, searches code defects, violates coding specifications and malicious or illegal behaviors, and is high in speed, accurate, flexible in customization and development, transparent in price and high in C + +/C and Java/Android detection capability. DMSCA is a scanning analysis service platform for source code security holes, quality defects and logic defects, which are introduced by Shanghai Mar science and technology, analyzes software source code security holes, quality defects and logic defects on the basis of a static analysis technology, is convenient for enterprises to evaluate, monitor and improve software security and product quality, manage development teams and outsourcing teams, and supports customization of a customized platform.

The prior art has the following problems:

1. rules need to be written manually. The existing detection method and tool usually require a professional practitioner to analyze each mode of each type of code defect and write a corresponding detection rule, or a user to write a custom rule according to own requirements. Different research institutions and different enterprises can customize the source code detection standard according to the situation, and have applicable coding specifications for various programming languages. Defining rules by human experts is far from covering all standards and vulnerability types, and the emergence of new vulnerability types also requires human experts to redefine new detection rules. Writing detection rules requires a large human input, and the resulting rules are highly subjective, of varying quality, and sometimes difficult to cover all possibilities of code defect patterns.

2. The false alarm rate and the missing report rate are high. For the existing static analysis tools of source codes, although the respective analysis mechanisms are different, the tools all have a common weakness, and the false alarm rate or the missing report rate is high, so that the usability of the detection tools in a real scene is reduced. The reason for this weakness is mainly two, one is that the static analysis technique itself has limitations and is an undeterminable problem in the worst case, and the other is that most of the analysis models adopt flow-insensitive and context-insensitive analysis methods to simplify the design. Due to the inaccuracy of an analysis mode and insufficient acquisition of context information, the encoding rule check by adopting a formal method brings uncertainty. While complete precise analysis such as directional analysis, guard value analysis, non-guard value analysis and the like is technically difficult to realize, and often has very high time complexity and space complexity, and a rule matching mechanism of the existing detection system is usually not perfect.

3. New defect patterns cannot be found. The existing detection tool does not have the capability of automatically finding a new defect mode, and because the detection rules are manually defined, the detectable content is also known code defects. If the detection rules themselves are not strict, it is likely that some defect patterns will be missed.

Disclosure of Invention

The invention aims to provide a source code defect detection method and a source code defect detection device, which can effectively improve the source code defect detection efficiency and accuracy.

In order to achieve the above object, the present invention provides a source code defect detection method, which comprises:

acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data;

training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, wherein the trained neural network model is used for judging whether the sensitive points have safety defects or not;

inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model to determine whether a security defect exists in a sensitive point corresponding to the feature code segment.

In order to achieve the above object, the present invention further provides a source code defect detecting apparatus, including:

the acquisition module is used for acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data;

the training module is used for training the neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model which is used for judging whether the sensitive points have safety defects or not;

and the detection module is used for inputting the feature code segment containing the feature information of the source code sensitive point of the unknown sample into the trained neural network model so as to determine whether the sensitive point corresponding to the feature code segment has a safety defect.

In summary, the present invention provides a method for detecting a source code defect, which includes: acquiring a feature code segment containing source code sensitive point feature information of a known sample as training data; training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, wherein the trained neural network model is used for judging whether the sensitive points have safety defects or not; inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model to determine whether a security defect exists in a sensitive point corresponding to the feature code segment. Because the scheme of the invention leads the source code detection to be more and more accurate by continuously training the neural network model, the problems of manual writing of rules and high false alarm rate and missing report rate in the prior art are solved.

Drawings

Fig. 1 is a schematic flow chart of a source code feature extraction method according to an embodiment of the present invention.

FIG. 2 is a simplified abstract syntax tree according to a second embodiment of the present invention.

Fig. 3 is a schematic diagram of a feature code segment code line arrangement process according to a second embodiment of the present invention.

Fig. 4 is a schematic flow chart of a source code defect detection method according to a third embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a source code defect detection apparatus according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the solutions of the present invention are further described in detail below by referring to the accompanying drawings and examples.

Code examination is an important link in the software development process and is very important for maintaining software safety. Although various techniques and tools have been proposed to detect source code defects, reducing the false negative and false positive rates of static analysis tools remains a hot problem. Artificial intelligence, as a science, is the intersection of statistics and computer science, and is now widely applied to various fields, and plays an increasingly important supporting role in scientific research and application. In the aspect of code vulnerability and defect prediction, the application of artificial intelligence brings new direction and mode for research in the field. How to better combine artificial intelligence with defect detection is also a big problem faced by people at present. The invention provides a source code defect detection method based on artificial intelligence, which is used for scanning and analyzing a source code file, training a neural network model by adopting an artificial intelligence technology, detecting software code defects related to data streams by taking sensitive points as entry points, and providing a method and a way for exploring the application of artificial intelligence in the safety field, making up the defects in the traditional tool and improving the automatic code defect detection capability.

Example one

Because the input neural network model of the invention is a vector formed by the feature code segments, and the feature code segments are formed after feature extraction is carried out on the source code file, the embodiment first explains how each feature code segment in the source code file is obtained.

The embodiment first preprocesses a source code file; then, carrying out data flow analysis by taking the sensitive point as an entry point to obtain a characteristic code segment with a reserved key code line; and finally, post-processing the source code file to obtain the serial number of the feature code fragment, the path of the source code file and the line number of the sensitive point to form the feature code file comprising the feature code fragment.

A schematic flow diagram of a method for extracting source code features provided in an embodiment of the present invention is shown in fig. 1, and includes the following steps:

and step 11, preprocessing the source code file, specifically including normalizing the source code file and developing macro definition.

In source codes of different programming languages, situations that sentences and character strings are written across lines, null values are different in form, and support for unary operators is different may occur. In the C and C + + languages, developers may also define macros to replace portions of code. To remove the influence of such cases on code analysis, we will normalize and develop macro definitions in advance, merge across strings, unify null characters, convert unary operators (e.g., a + + to a ═ a +1), and develop macro definitions.

And step 12, analyzing the data stream according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code.

The method specifically comprises the following steps:

and S121, determining each sensitive point in the source code file.

The source code refers to an uncompiled text code file, such as a.c.cpp.java file, written by a developer according to a programming language specification during software development. The source code files support programming languages such as JAVA, C, and C + +.

The sensitive points are divided into three categories, including sensitive function call, array sensitive operation or pointer sensitive operation; wherein the sensitive function comprises a library function or an API function; array-sensitive operations include access to or assignment of array elements; pointer sensitive operations include access to or assignment of a pointer.

The sensitive point is a heuristic concept, which can be regarded as a "center" of a flaw or a defect, and is a position where the defect easily occurs. Through research and analysis on defect code samples, we can find that a large number of bugs and defects are associated with improper use of certain library functions or API functions, and improper operation of arrays and pointers. For vulnerabilities that do not use some library functions or API functions correctly, the sensitive point is a library or API function call, and for vulnerabilities that do not use arrays, the sensitive point is an operation on the array. Some bugs, defects may have a variety of sensitive points, for example, buffer errors may be associated with library or API function calls, array and pointer operations. And the same code sensitive point may exist in a plurality of bugs, for example, the buffer fault and the resource management fault bugs both comprise a library or an API function call. The method takes three types of code behaviors of sensitive function call, pointer and array sensitive operation as sensitive points, and processes the code behavior entries to extract the characteristics of the code.

Among them, the sensitive function is a library function or an API function which is liable to cause a security problem due to improper use by a program developer. Library functions are predefined functions provided by a compiler or a development tool that can be called in a source program. The system comprises library functions specified by the C/C + + language standard, JDKs (java development kits) and java library functions provided by enterprises and organizations. The API functions are predefined functions provided by the operating system to the program developer. Sensitive operations are operations on arrays or pointers which are prone to cause security problems, and include access to array elements and pointers and assignment of values.

Wherein, step S121 specifically includes:

s1211, generating an abstract syntax tree from the source code file;

an abstract syntax tree is an abstract representation of the syntax structure of the source code. It represents the syntactic structure of the programming language in the form of a tree, each node on the tree representing a structure in the source code.

S1212, analyzing the abstract syntax tree, and recording the position of each sensitive point in the source code file to form a sensitive point list; and storing the function name and the parameter called by the user-defined function, the function name and the parameter of the user-defined function and the calling relation among the functions.

And S122, for each sensitive point, tracking the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, acquiring other sensitive variables, and obtaining a code line set related to the semantics of each sensitive variable.

For each sensitive point, the initial sensitive variable is the first sensitive variable added to the list of sensitive variables to which the initial sensitive variable belongs. The list of sensitive variables includes a plurality of sensitive variables.

When the sensitive point is a sensitive function call, the initial sensitive variable is a parameter of the sensitive function call;

when the sensitive point is an array sensitive operation, the initial sensitive variable is the array;

when the sensitive point is a pointer sensitive operation, the initial sensitive variable is the pointer

The method specifically comprises the following steps:

s121, circularly traversing the sensitive point list, acquiring an initial sensitive variable of each sensitive point, adding the initial sensitive variable into the sensitive variable list, tracking a data stream of the initial sensitive variable by analyzing the abstract syntax tree, and recording a code line set related to the initial sensitive variable semantics;

s122, iteratively tracking other sensitive variables by taking the initial sensitive variable as a starting point, wherein in the process of tracking the data stream of the initial sensitive variable,

when a first variable is transmitted to a sensitive variable in a code line in the code line set, adding the first variable serving as the first sensitive variable into a sensitive variable list, tracking a data stream of the first sensitive variable by analyzing the abstract syntax tree, and recording a code line set related to the semantics of the first sensitive variable;

or according to the calling relation among the functions, the function name and the parameter called by the user-defined function, and the function name and the parameter of the user-defined function, determining a second sensitive variable and adding the second sensitive variable into a sensitive variable list, tracking the data stream of the second sensitive variable by analyzing the abstract syntax tree, and recording a code line set semantically related to the second sensitive variable.

The method comprises the following steps of determining a first sensitive variable according to a calling relation between functions, a function name and a parameter called by a user-defined function, and the function name and the parameter of the user-defined function, and specifically comprises the following steps:

according to the calling relation among the functions, the function called by the user-defined function is the same as the user-defined function;

and determining that the form parameter corresponds to the actual parameter and is a second sensitive variable according to the fact that the parameter of the function called by the same user-defined function is the actual parameter and is a sensitive variable and the fact that the parameter of the same user-defined function is the form parameter.

And then, a recursive processing method is adopted to enter the definition of the called function, sensitive variables are continuously tracked, the sensitive variables are added into a sensitive variable list, and code line sets related to the sensitive variables in a semantic mode are recorded.

That is to say, for each sensitive point in the sensitive point list, a sensitive variable list is corresponded, and the sensitive variable list includes a plurality of sensitive variables, and is not limited to the first sensitive variable and the second sensitive variable.

In the above step, when there is a first variable transferred to a sensitive variable in a code line in the code line set, that is, if there is a process in which information of a certain variable is transferred to an initial sensitive variable or other sensitive variables, the variable is also a sensitive variable. In this case, the transfer means that when an assignment statement (e.g., "variable ═ expression;", where "═" may also be +, -,/,%, ") occurs in the code, and a conditional assignment statement (e.g.," variable name.

And S123, arranging the acquired code line sets according to the sequence in the source code file to obtain a feature code segment containing feature information of the source code sensitive point.

When there is a call relation between functions, after arranging the acquired code line sets in the order in the source code file, the method further includes: and splicing the code lines according to the execution sequence of the functions.

And step 13, post-processing the source code file to obtain the serial number of the feature code fragment, the path of the source code file and the line number of the sensitive point.

The feature code file comprises a feature code segment number, a source code file path, a line number of a sensitive point and a specific feature code segment. When the feature code file is used for detecting code defects, a certain feature code segment is considered to be in a problem and can be positioned to a corresponding position in the source code through a line number.

Thus, the source code feature extraction method of the present invention is completed. And one sensitive point corresponds to obtain a feature code segment containing the feature information of the sensitive point of the source code. Many software security problems are caused by software source code defects, and code review becomes an important link in the software development process and is very important for maintaining software security. The formation of code defects generally involves multiple lines of codes, and the source code features mentioned in the invention mainly refer to code defect features, including sensitive points and related code lines in close semantic relation with the sensitive points. A feature code segment embodies the context information of a code sensitive point, and removes irrelevant information. The source code characteristics extracted by the method can be applied to the field of source code defect detection.

Example two

Based on the first embodiment, specific scenarios are listed below for the purpose of clearly illustrating the present invention. The C language code.c shown in table 1 is explained in detail as an example.

TABLE 1

The code is an example code for doubly releasing bugs, a store _ handle _ tint function in the code receives a character string and copies the character string into a heap memory, namely copies the character string into a store _ buffer, then checks whether a first character is greater than or equal to 'a', and if so, calls the store _ process _ buffer function. If the first letter of the character string is greater than or equal to 'a', the Stonesoup _ buffer is released once in the Stonesoup _ process _ buffer function and released again in the 25 th line of the Stonesoup _ handle _ paint function, so that a double release vulnerability is formed.

And step 21, preprocessing the source code file.

The macro definition is defined in line 1 of the code, the "+ +" operator is used in line 14, the annotation is provided in

lines

23 and 28, and the following pre-processing is required in order to remove the influence of these non-code characters and writing habits on the code analysis, in the sentence written across lines 24 and 25:

1) the macro definition in line 17 of the code is expanded according to line 1 of the code,

2) replace line 14 with:

“stonesoup_global_variable＝stonesoup_global_variable+1”，

3) the annotations at the end of

lines

23 and 28 are removed,

4) the 24, 25 lines are merged (keeping the line number of the code unchanged after 25 lines). The result of the preprocessing of the source code file is shown in table 2.

TABLE 2

And step 22, analyzing the data stream according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code.

First, an abstract syntax tree of a code is generated, and fig. 2 is a simplified abstract syntax tree diagram. Under the root file, the definition (function definition) of the method includes the function stonemoup _ process _ buffer and the parameter buffer _ param, and also includes the function stonemoup _ handle _ tint and the parameter champion _ address. The call of the method (funcnameexpr) includes the function call relationship of stonesoup _ handle _ taint to stonesoup _ process _ buffer.

The sensitive spot is recorded next. In the code example, the free function, the strcpy function and the malloc function are sensitive functions which are easy to generate bugs, the access and assignment of the pointer are sensitive operations which are easy to cause security problems, so that the code example comprises sensitive function calls, the 5 th, 6 th, 16 th, 17 th, 22 th and 28 th sensitive operations of the pointer all contain sensitive points, and the parameters of the sensitive functions and the pointer subjected to the sensitive operations are initial sensitive variables. Meanwhile, the function name and its parameters (function stop _ process _ buffer and parameter stop _ buffer in row 24) of the user-defined function call, the function name and parameters (function stop _ process _ buffer and parameter buffer _ param, function stop _ handle _ tint and parameter change _ address) of the user-defined function in the code, and the function call relation (stop _ handle _ tint call stop _ process _ buffer) included in the code are stored.

Taking the free function in the 28 th row as an example, taking the parameter of the free function as an initial sensitive variable, and analyzing the abstract syntax tree to obtain the row semantically related to the parameter "store _ buffer" of the free function in the code, such as the row identified by the segment 1 in table 3, including the 12 th, 17 th, 18 th, 22 th, 23 th, 24 th, 27 th and 28 th rows.

On line 17 of the code, there is a process of transferring information of the variable "royal _ response" to the sensitive variable "storage _ buffer", that is, data flows from "royal _ response" to "storage _ buffer", so "royal _ response" is also added to the sensitive variable list, and the lines semantically related to the sensitive variable list, including

lines

13, 16 and 17, in the code are obtained. Similarly, in line 16, data flows from "champion _ admillo" to "royal _ resaw" and further to "storesound _ buffer", the variable "champion _ admillo" becomes a sensitive variable, and is also added to the sensitive variable list, and the lines semantically related to the variable in the code are obtained, including

lines

10, 15 and 16. The line identified by line segment 2 in table 3 is the line obtained by the dataflow analysis.

In line 24 of the code, the user-defined function, namely, store _ process _ buffer, is called in the code, and knowing that store _ buffer is a sensitive variable, the function name and parameter called by the user-defined function, stored before, and the function name and parameter of the user-defined function, the actual parameter, namely, store _ buffer, corresponds to the form parameter, namely, buffer _ param, in the function definition, so that the definition of store _ process _ buffer is entered, buffer _ param is used as a sensitive variable, and semantically related code lines, such as the lines identified by line segment 3 in table 3, including

lines

3, 5, and 6.

TABLE 3

And finally, arranging the obtained code lines in sequence, and splicing the code lines according to a function execution sequence according to a function call relation. Since the function store _ handle _ target calls store _ process _ buffer at line 24, the code line extracted from store _ process _ buffer will be arranged after line 24 and before line 27 in store _ handle _ target. The data flow analysis phase is completed. The signature code snippet code line arrangement process is illustrated in fig. 3, where the numbers represent the line numbers of the code lines.

And 23, post-processing the source code file to obtain a feature code fragment number, a source code file path and a line number of a sensitive point.

In the last step, a line number sequence containing vulnerability characteristics in the code is obtained, and in the step, information such as the serial number of the output characteristic code segment, the path of the source code and the like, and a code line corresponding to the line number and a sensitive point correspond to one characteristic code segment. The code features extracted from the source code files shown in table 1 are shown in table 4. The method comprises 6 sensitive points, wherein the serial numbers are respectively 1 to 6, taking the 6 th sensitive point as an example, the path of a source code is/path/to/code.c, the line number 28 of the sensitive point is located, and a characteristic code segment included by the 6 th sensitive point is specifically expanded.

1/path/to/code.c 5

…

-----------------------------------

2/path/to/code.c 6

…

-----------------------------------

3/path/to/code.c 16

…

-----------------------------------

4/path/to/code.c 17

…

-----------------------------------

5/path/to/code.c 22

…

-----------------------------------

6/path/to/code.c 28

void stonesoup_handle_taint(char*champion_adamello)

char*stonesoup_buffer＝0；

char*royalising_resaw＝0；

if(champion_adamello！＝0){

royalising_resaw＝((char*)champion_adamello)；

stonesoup_buffer＝malloc((strlen(royalising_resaw)+1)*sizeof(char))；

if(stonesoup_buffer＝＝0){

strcpy(stonesoup_buffer，royalising_resaw)；

if(stonesoup_buffer[0]＞＝97){

stonesoup_printf(″Index of first char：％i\n″，stonesoup_process_buffer(stonesoup_buffer))；

char stonesoup_process_buffer(char*buffer_param){

first_char＝buffer_param[0]-97；

free(buffer_param)；

if(stonesoup_buffer！＝0){

free(stonesoup_buffer)；

-----------------------------------

TABLE 4

In summary, in the above embodiment, there are 6 sensitive points, and each sensitive point corresponds to obtain a feature code segment containing feature information of a source code sensitive point, including a sensitive point and a related code line having a close semantic relationship with the sensitive point. If a problem occurs in a certain characteristic code segment during the defect detection of the source code, which code line the problem comes from can be accurately determined, so that the detection accuracy is effectively improved.

EXAMPLE III

This embodiment illustrates the core idea of the invention. The method comprises three parts, namely acquisition of known samples, training of a neural network model and defect detection of unknown samples through the neural network model.

A schematic flow chart of the method for detecting a source code defect provided by the third embodiment of the present invention is shown in fig. 4, and the method includes:

and step 41, collecting feature code segments containing source code sensitive point feature information of known samples as training data.

Artificial intelligence based techniques typically require a large amount of data as support. Researchers have shown that AI techniques can work well in the field of code defect detection, but require sufficient and comprehensive training data. The present invention also requires sufficient data set support when training the model. The invention downloads or crawls a large number of source code files from the public data sets of SARD, OWASP, NVD and the like to construct a training sample set. These public data sets contain over a hundred thousand test cases, covering C, C + +, Java, etc. programming languages, including CWE defect classes. That is, the known sample of the present invention may be a piece of feature code obtained from a source code file of the data set described above.

And 42, training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, wherein the trained neural network model is used for judging whether the sensitive points have safety defects or not.

The method includes the following steps that according to the obtained feature code fragments, a deep learning method is adopted to train a neural network model, and the trained neural network model is obtained, and the method specifically includes the following steps:

s421, performing symbolization and vectorization on each feature code segment, packaging the feature code segment into a vector, inputting the vector as the ith sample in M samples of a deep learning training set into a neural network model, and obtaining a predicted value of whether a sensitive point has a safety defect; i belongs to M, and M is a natural number;

in step S421, the symbolization of the first step is to convert the feature code segments into their symbolic representation. This step is intended to heuristically capture some semantic information in the program used to train the neural network.

Symbolizing the feature code segment includes: mapping each user-defined variable in the feature code segment to a first symbol name one by one; and mapping each user-defined function in the feature code segment to a second symbolic name one by one. Wherein the first symbol name refers to a plurality of symbol names of the same type (e.g., "VAR 1", "VAR 2"); note also that when multiple user-defined variables appear in different feature code snippets, they may map to the same symbolic name. The second symbol name refers to a plurality of symbol names (e.g., "FUN 1", "FUN 2") of the same type form that are distinguished from the first symbol name; note also that when multiple user-defined functions appear in different feature code segments, they may map to the same symbolic name.

It should be further noted that the user-defined variables in this embodiment include, but are not limited to, sensitive variables.

In step S421, the second step is vectorization. The machine learning algorithm can only receive input in the form of vectors, and we choose a word vector model to map a symbol to an integer and then convert it to a vector of fixed length. Since the feature code segments may have different numbers of symbols, the corresponding vectors may have different lengths. For vectors that are too long, truncation is performed at the beginning, and the lack of length is filled with 0 at the beginning. Through vectorization, the processing of text content can be simplified into vector operation in a vector space, and the similarity in the vector space is calculated to express the similarity in text semantics.

Vectorizing the tokenized feature code segments comprises: splitting the characteristic code segment comprising the first symbol name and the second symbol name to form a symbol sequence; the word vector model is selected to map each symbol in the sequence of symbols to an integer and then converted to a vector having a predetermined length.

It should be noted that, in step S421, each feature code segment is packaged into a vector after being symbolized and vectorized, which is a preferred embodiment. That is, if the feature code segment is directly vectorized, the neural network model may be input, but the input of the vector that is not symbolized to the neural network model may affect the effect of the neural network model training, and therefore, the first symbolization of the feature code segment may reduce the adverse effect.

S422, constructing a loss function according to whether the predicted value and the true value of the safety defect exist in the sensitive point of the ith sample, optimizing the minimum value of the loss function, updating the weight parameter of the network, and obtaining the trained neural network model.

The real value of whether the security defect exists in the sensitive point of each sample can be determined in advance by observing the feature code segment of the sample. For data sets that are SARD, OWASP that provide source code and markup directly as such, the true value of the sample is set to "1" if the lines contained in the feature code fragment are marked as defective in the data set or the source code that produced the feature code fragment is marked as defective, otherwise set to "0". For a data set which is not directly marked by NVD, if the feature code segment contains at least one statement deleted or modified according to the patch, which indicates that the sensitive point of the feature code segment has a security defect, the true value of the sample is set to be '1', otherwise, the true value is set to be '0'. And when the predicted value and the true value of the sample are closer, the loss function takes the minimum value, the obtained network weight parameter is optimal, and the model training is finished. The trained neural network model may be used in a subsequent step 43 to determine, for a feature code segment of an unknown sample, whether a security defect exists in a sensitive point corresponding to the feature code segment.

The gradient of the loss function is calculated by adopting a gradient descending method, and the direction with the fastest gradient descending is selected to ensure that the loss function is the minimum. The neural network model corresponds to a set of optimal network weight parameters after training.

And 43, inputting a feature code segment containing the feature information of the sensitive point of the source code of the unknown sample into the trained neural network model to determine whether the sensitive point corresponding to the feature code segment has a safety defect.

Further, after the training of the neural network model by using the deep learning method between step 42 and step 43, in order to optimize the accuracy of the neural network model, the method further includes: and performing performance evaluation on the trained model until the performance requirement is met.

It should be noted that the feature code segment of the known sample in step 41 and the feature code segment of the unknown sample in step 43 are obtained according to the first embodiment. Furthermore, the input to the neural network model in

steps

42 and 43 are vectors formed by the feature code segments.

In summary, the neural network model of the embodiment can be dynamically updated along with the increase of the known sample data set through the training process, and the detection efficiency and accuracy are continuously improved. After the unknown sample is input into the trained neural network model, an accurate detection result of whether the safety defect exists in the sensitive point of the unknown sample can be obtained, and therefore the purpose of the method is achieved.

Example four

Based on the third embodiment, specific scenarios are listed below for the purpose of clearly illustrating the present invention. The feature code segment included in the 6 th sensitive point shown in table 4 is taken as an example for explanation, and the feature code segment is subjected to symbolization and vectorization. The signature code fragments are shown in table 5.

void stonesoup_handle_taint(char＊champion_adamello)

char＊stonesoup_buffer＝0；

char＊royalising_resaw＝0；

if(champion_adamello！＝0){

royalising_resaw＝((char＊)champion_adamello)；

stonesoup_buffer＝malloc((strlen(royalising_resaw)+1)＊sizeof(char))；

if(stonesoup_buffer＝＝0){

strcpy(stonesoup_buffer，royalising_resaw)；

if(stonesoup_buffer[0]＞＝97){

char stonesoup_process_buffer(char＊buffer_param){

first_char＝buffer_param[0]-97：

free(buffer_param)；

if(stonesoup_buffer！＝0){

free(stonesup_buff)；

TABLE 5

First, the coding is performed.

In the feature code segment, champion _ admalllo, stonesoup _ buffer, royal _ resaw, buffer _ param, and first _ char are user-defined variable names to be mapped one-to-one to symbols VAR1, VAR2, VAR3, VAR4, and VAR5, as shown in table 6.

void stonesoup_handle_taint(char＊VAR1)

char＊VAR2＝0；

char＊VAR3＝0；

if(VAR1！＝0){

VAR3＝((char＊)VAR1)；

VAR2＝malloc((strlen(VAR3)+1)＊sizeof(char))；

if(VAR2＝＝0){

strcpy(VAR2，VAR3)；

if(VAR2[0]＞＝97){

stonesoup_printf(″Index of first char：％i\n″，stonesoup_process_buffer(VAR2))；

char stonesoup_process_buffer(char＊VAR4){

VAR5＝VAR4[0]-97；

free(VAR4)；

if(VAR2！＝0){

free(VAR2)；

TABLE 6

In the feature code segment, the function name _ handle _ taint, the function name _ printf and the function name _ process _ buffer are user-defined functions, and these function names are mapped one by one to symbols FUN1, FUN2 and FUN3, as shown in table 7.

void FUN1(char＊VAR1)

char＊VAR2＝0；

char＊VAR3＝0；

if(VAR1！＝0){

VAR3＝((char＊)VAR1)；

VAR2＝malloc((strlen(VAR3)+1)＊sizeof(char))；

if(VAR2＝＝0){

strcpy(VAR2，VAR3)；

if(VAR2[0]＞＝97){

FUN2(”Index of first char：％i\n″，FUN3(VAR2))；

char FUN3(char＊VAR4){

VAR5＝VAR4[0]-97；

free(VAR4)；

if(VAR2！＝0){

free(VAR2)；

TABLE 7

The signed feature code segment is then vectorized.

The signed feature code segment is split into individual symbols, and the former two behaviors are as follows:

void FUN1(char*VAR1)

char*VAR2＝0；

will be split into: "void", "FUN 1", "(", "char", "," VAR1 ",") "," char "," "," VAR2 "," ═ 0 ","; ".

The next few rows are similar. The signed signature code segment shown in table 7 will be decomposed into a symbol sequence of 137 symbols.

And selecting a word vector model, mapping each symbol in the symbol sequence to an integer, and converting the integer into a vector with a fixed length. The results after transformation as in the first two lines are: 38, 15,0, 12,6,4,1, 12,6,7,5, 14,2.

The next few rows are similar. Wherein like symbols are mapped to like numbers. E.g. 12, 6, occur twice because they are the result of the mapping of the symbols "char", respectively ". The entire symbol sequence will be converted into a number vector containing 137 numbers.

Assuming that the machine learning model happens to require a length 137 vector as input, the sequence of numbers is not subject to additional processing. Assuming that the model requires as input a vector of length 130, the sequence of numbers is too long and needs to be truncated from the beginning, i.e. the first seven numbers are deleted, the sequence of numbers becomes 12, 6, 7, 5, 14, 2, … assuming that the model requires as input a vector of length 140, the sequence of numbers is too short and needs to be filled with 0 in the beginning, i.e. the sequence of numbers becomes 0, 0, 0, 38, 15, 0, 12, 6, 4, 1, 12, 6, 7, 5, 14, 2- · and 0, 0, 12, 6, 4, 1, 12, 6, 7, 5, 14, 2- ·

The vectorized digital vectors are input into a machine learning model, i.e., a neural network model. The physical meaning is the mapping of text content in vector space, and one vector represents one feature code segment. Vectorization may simplify the operations between text information such as feature code fragments to operations between vectors.

EXAMPLE five

Based on the same inventive concept as the embodiment, the invention also discloses a source code defect detection device, the schematic structural diagram of which is shown in fig. 5, the device comprises:

the acquisition module 501 is used for acquiring a feature code segment containing feature information of a source code sensitive point of a known sample as training data;

the training module 502 is used for training a neural network model by adopting a deep learning method according to the acquired feature code segments to obtain the trained neural network model, and is used for judging whether the sensitive points have safety defects or not;

the detection module 503 inputs the feature code segment containing the feature information of the source code sensitive point of the unknown sample into the trained neural network model to determine whether the sensitive point corresponding to the feature code segment has a security defect.

The acquisition of the feature code fragment comprises the following steps: and analyzing the data stream according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code.

The data flow analysis is performed according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code, and the method specifically includes:

determining each sensitive point in a source code file;

for each sensitive point, tracking the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, and acquiring other sensitive variables to obtain a code line set related to the semantics of each sensitive variable;

arranging the acquired code line sets according to the sequence in the source code file to obtain a feature code segment containing feature information of the source code sensitive points;

the sensitive point comprises a sensitive function call, an array sensitive operation or a pointer sensitive operation; the sensitive function comprises a library function or an API function; array-sensitive operations include access to or assignment of array elements; pointer-sensitive operations include access to or assignment of a pointer;

when the sensitive point is a pointer sensitive operation, the initial sensitive variable is the pointer.

Before determining each sensitive point in the source code file, preprocessing the source code file, including normalizing the source code file and developing macro definition;

after arranging the acquired code line sets according to the sequence in the source code file, post-processing the source code file to acquire the feature code segment number, the source code file path and the line number of the sensitive point.

The determining each sensitive point in the source code file comprises:

generating an abstract syntax tree from the source code file;

analyzing the abstract syntax tree, and recording the position of each sensitive point in a source code file to form a sensitive point list; and storing the function name and the parameter called by the user-defined function, the function name and the parameter of the user-defined function and the calling relation among the functions.

For each sensitive point, tracking the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, acquiring other sensitive variables, and obtaining a code line set semantically related to each sensitive variable, specifically comprising:

circularly traversing the sensitive point list, acquiring an initial sensitive variable of each sensitive point, adding the initial sensitive variable into the sensitive variable list, tracking a data stream of the initial sensitive variable by analyzing the abstract syntax tree, and recording a code line set related to the initial sensitive variable semanteme;

iteratively tracking other sensitive variables by taking the initial sensitive variable as a starting point, and in the process of tracking the data stream of the initial sensitive variable,

or determining a second sensitive variable and adding the second sensitive variable into a sensitive variable list according to the calling relation among the functions, the function name and the parameter called by the user-defined function, and the function name and the parameter of the user-defined function, tracking the data stream of the second sensitive variable by analyzing the abstract syntax tree, and recording a code line set semantically related to the second sensitive variable.

According to the calling relation among the functions, the function name and the parameter called by the user-defined function, and the function name and the parameter of the user-defined function, determining a second sensitive variable, which specifically comprises the following steps:

The training module 502 trains the neural network model by a deep learning method according to the acquired feature code segments to obtain a trained neural network model, and is specifically configured to:

performing symbolization and vectorization on each feature code segment, then packaging the feature code segment into a vector, using the vector as the ith sample in M samples of a deep learning training set, and inputting the sample into a neural network model to obtain a predicted value of whether a security defect exists in a sensitive point; i belongs to M, and M is a natural number;

and constructing a loss function according to whether the predicted value and the true value of the safety defect exist in the sensitive point of the ith sample, optimizing the minimum value of the loss function, and updating the weight parameter of the network to obtain the trained neural network model.

The training module 502, symbolizing the feature code segment, includes: mapping each user-defined variable in the feature code segment to a first symbol name one by one; mapping each user-defined function in the feature code segment to a second symbol name one by one;

The training module 502, after training the neural network model by using a deep learning method, is further configured to: and performing performance evaluation on the trained model until the performance requirement is met.

In conclusion, the beneficial effects of the invention are as follows:

one, do not need to write the rule manually

The existing detection method and tool usually require various modes of analyzing code defects by professional practitioners to write corresponding detection rules. Defining rules by human experts is far from covering all standards and vulnerability types, and the emergence of new vulnerability types also requires human experts to redefine new detection rules. According to the method provided by the invention, only the code sensitive points need to be manually summarized, the subsequent characteristic extraction steps can be automatically completed without investing a large amount of manpower to compile rules, and in actual use, the good balance of various factors such as manpower, calculated amount, speed and effect can be achieved.

Second, the false alarm rate is low

The existing static analysis tool of the source code has the defects of high false alarm rate or high missing report rate due to the limitation of the static analysis technology and the inaccuracy of the analysis mode. Complete refinement analysis such as directional analysis, guard value analysis, non-guard value analysis, and the like is technically difficult to implement, and often has high temporal complexity and spatial complexity. And based on the artificial intelligence detection tool, the model can learn the context information of the defect codes, and the problems of high false alarm rate and high missing report rate can be better solved. Meanwhile, once the training of the model is completed, the detection efficiency is higher, so that developers save the energy of troubleshooting of misinformation, and the usability of the tool in actual production is improved.

Thirdly, the detection effect can be continuously improved

After the rule definition in the existing method and tool is completed, the detection effect is fixed, unless manpower is used for writing new rules. And based on an artificial intelligence method, the trained model is from the constructed sample, and the model has the capability of detecting vulnerability patterns existing in all training samples. With the continuous accumulation of samples, the number of the vulnerability patterns can be continuously and automatically updated in an iterative manner, and the detection capability of the trained model can be correspondingly improved.

Fourthly, a new defect mode can be discovered

The existing detection tool has the defect that the detectable content is the known code because the detection rules are all defined manually. If the detection rules themselves are not strict, it is likely that some defect patterns will be missed. The artificial intelligence algorithm has no strict rule matching limit, so that the discovery of patterns similar to but different from known defect patterns is possible. If a new code defect form appears, the corresponding detection model can be trained as long as the training sample is obtained, and no expert is required to invest a great deal of energy for research.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting a defect in a source code, the method comprising:

inputting a feature code segment containing source code sensitive point feature information of an unknown sample into a trained neural network model to determine whether a sensitive point corresponding to the feature code segment has a safety defect;

the method for acquiring the feature code fragment comprises the following steps: performing data stream analysis according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code;

determining each sensitive point in a source code file;

the sensitive point comprises a sensitive function call, an array sensitive operation or a pointer sensitive operation; the sensitive function comprises a library function or an API function; array-sensitive operations include access to or assignment of array elements; pointer sensitive operations include access to or assignment of a pointer;

when the sensitive point is a pointer sensitive operation, the initial sensitive variable is the pointer;

the determining each sensitive point in the source code file comprises:

generating an abstract syntax tree from the source code file;

analyzing the abstract syntax tree, and recording the position of each sensitive point in a source code file to form a sensitive point list; storing the function name and parameter called by the user-defined function, the function name and parameter of the user-defined function and the calling relation among the functions;

2. The method of claim 1,

before determining each sensitive point in the source code file, preprocessing the source code file, including normalizing the source code file and developing macro definitions;

after arranging the acquired code line sets according to the sequence in the source code file, the method further comprises the step of carrying out post-processing on the source code file to acquire the feature code segment number, the source code file path and the line number of the sensitive point.

3. The method of claim 2, wherein determining the second sensitive variable according to the calling relationship among the functions, the function name and parameter called by the user-defined function, and the function name and parameter called by the user-defined function comprises:

obtaining the function called by the user-defined function to be the same as the user-defined function according to the calling relation among the functions;

4. The method of claim 1, wherein the training of the neural network model by a deep learning method according to the obtained feature code segments to obtain the trained neural network model specifically comprises:

performing symbolization and vectorization on each feature code segment, packaging the feature code segment into a vector, inputting the vector as the ith sample in M samples of a deep learning training set into a neural network model, and obtaining a predicted value of whether a sensitive point has a safety defect; i belongs to M, and M is a natural number;

5. The method of claim 4,

symbolizing the feature code segment includes: mapping each user-defined variable in the feature code segment to a first symbol name one by one; mapping each user-defined function in the feature code segment to a second symbol name one by one;

6. The method of claim 1, wherein after training the neural network model using the deep learning method, the method further comprises:

and performing performance evaluation on the trained model until the performance requirement is met.

7. A source code defect detection apparatus, the apparatus comprising:

the detection module is used for inputting a feature code segment containing source code sensitive point feature information of an unknown sample into the trained neural network model so as to determine whether a sensitive point corresponding to the feature code segment has a safety defect;

the acquisition of the feature code segments in the acquisition module is specifically used for: performing data stream analysis according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code;

the acquisition module performs data stream analysis according to each sensitive point in the source code file to obtain a feature code segment containing feature information of the sensitive point of the source code, and is specifically used for:

determining each sensitive point in a source code file;

the sensitive points comprise sensitive function calls, array sensitive operations or pointer sensitive operations; wherein the sensitive function comprises a library function or an API function; array sensitive operations include access to or assignment of array elements; pointer sensitive operations include access to or assignment of a pointer;

the acquisition module determines each sensitive point in the source code file, and is specifically configured to:

generating an abstract syntax tree from the source code file;

for each sensitive point, the acquisition module tracks the data stream of the initial sensitive variable according to the initial sensitive variable of each sensitive point, acquires other sensitive variables, and obtains a code line set semantically related to each sensitive variable, and is specifically used for: