CN108427643B

CN108427643B - Binary program fuzzy test method based on multi-population genetic algorithm

Info

Publication number: CN108427643B
Application number: CN201810233482.3A
Authority: CN
Inventors: 罗森林; 侯留洋; 潘丽敏; 焦龙龙; 张笈
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2020-12-08
Anticipated expiration: 2038-03-21
Also published as: CN108427643A

Abstract

The invention relates to a binary program fuzzy test method based on multi-population genetic algorithm, belonging to the field of binary vulnerability mining in information security. The method adopts a multi-population genetic algorithm method, and firstly abstracts each test data individual into a chromosome. A main population and sub-populations 1 and 2 are then initialized, either randomly or from initial data, by recording the number of newly discovered edges in the test data execution path and the number of edges associated with the test data as a measure of fitness. And then, the good individuals of the sub-population are obtained by fitness sorting and are migrated to the main population. Finally, the main population and the sub population are respectively subjected to genetic operation (crossing and mutation) to obtain new individuals to be subjected to a new round of tracking execution. The method can effectively improve the coverage rate of the program execution path, can cover the specific program execution path, has obvious guiding significance for the generation of the test data, and has good application value and popularization value.

Description

Binary program fuzzy test method based on multi-population genetic algorithm

Technical Field

The invention relates to a binary program fuzzy test method, and belongs to the field of binary vulnerability mining in information security.

Background

The fuzzy test technology is the most common vulnerability mining method with good comprehensive effect in the security field at present, and monitors whether the execution process of the software has abnormity such as breakdown or the like by providing a randomly constructed or variant test case for a target software system so as to observe whether the target software has potential vulnerabilities. The higher the code coverage rate of the test data generated by the fuzzy test system is, the higher the possibility of finding a bug is, so that the code coverage rate of the test data can be used as an evaluation criterion of the quality of the test data generation. In general, in the fuzz test, the source code of the tested program is not available, so the format of the input data is unknown. The mutation-based approach generates new test data by directly modifying existing test data. However, because the mutation mode is random, a high code coverage rate cannot be achieved, and the vulnerability mining effect is not good. Therefore, the present invention will provide a binary program fuzz testing method of multi-population genetic algorithms to improve the code coverage of variant-based fuzz testing to generate test data.

The basic problems to be solved by the binary program fuzzy test method of the multi-population genetic algorithm are as follows: the problem that the test data randomly generated by the mutation-based method cannot achieve high code coverage rate. In view of the existing binary program fuzzy test method with unknown input data format, the commonly used methods can be classified into two types:

1. method of symbol execution

The symbol-based method processes the test data as symbol values. And testing a new execution path by collecting constraint information when the program processes the symbol value and then solving and generating new test data by utilizing the constraint information. Theoretically, the method can reach the code coverage rate of 100%, but for a complex program, the symbolic execution has the defect of path explosion, and the application range of the symbolic execution method is seriously influenced.

2. Method for evolving algorithms

The method based on the evolutionary algorithm converts the test data into a proper format so as to conveniently guide the generation of the test data, and the genetic algorithm is widely applied. At present, one or more execution paths are required to be predefined manually in the use of a genetic algorithm, so that test data conforming to the preset execution paths are generated, and other paths cannot be tested.

In summary, the conventional binary program fuzz testing method for an unknown input data format has the problems of being not suitable for complex programs, having few execution paths for testing, and the like. Therefore, the invention provides a binary program fuzzy test method based on multi-population genetic algorithm.

Disclosure of Invention

The invention aims to provide a binary program fuzzy test method based on multi-population genetic algorithm, aiming at solving the problems that the binary program fuzzy test method with unknown input data format is not suitable for complex programs and the execution path of the test is few.

The design principle of the invention is as follows:

the method comprises the steps of converting test data into individuals of a main population and a sub-population, generating new test data by using changes of the individuals in the evolution process, simultaneously influencing the evolution of the main population by using the new individuals of the sub-population, namely ensuring sharing and communication of excellent information among the populations by setting a migration operator, accelerating convergence speed while ensuring the diversity of the population individuals, namely accelerating the coverage rate of a program execution path, and showing the overall flow in figure 1.

The technical scheme of the invention is realized by the following steps:

step 1, population initialization.

Step 1.1, test data are converted into individuals in the population.

And 1.2, randomly initializing the main population and the sub-population.

And 2, positioning the basic block.

And 2.1, executing the program after Qemu instrumentation, and acquiring basic block information in the program execution process.

And 2.2, inserting codes before the basic block, and outputting the program execution path information to an external file.

And 3, monitoring whether the tested program is broken down or not and recording a program execution path.

And 3.1, recording a basic block sequence after the program is executed, wherein the basic block sequence can be converted into an edge sequence, and an edge is a jump between two continuous basic blocks.

And 3.2, merging the same edges in the edge sequence to obtain an edge set as program execution path information.

Step 3.3 all individuals X were treated according to step 3.1 and step 3.2_iObtaining the execution path information of the program corresponding to all the test data, i.e. the edge set E_i。

And 4, calculating the fitness of the test data and selecting excellent population individuals.

Step 4.1 calculates the increment in the number of edges found after test data execution as the f1 value for fitness calculation.

Step 4.2 updates the set of all edges once found and calculates the number of edges associated with the test data as the fitness calculated f2 value.

And 4.3, firstly comparing f1 values, then comparing f2 values for fitness ranking, and screening out excellent individuals (test data).

And 5, transferring the sub-population to the main population and crossing variation in the population.

Step 5.1 transfer a suitable number of superior individuals from the sub-population to the main population.

And 5.2, crossing in the population.

And 5.3, carrying out variation in the population.

And 5.3, bringing the newly obtained excellent individuals of the main population and the sub-population into the tested program to be executed, and repeating the steps 3 to 5.

Advantageous effects

Compared with a symbolic execution analysis method and other evolutionary algorithms, the test data generated in the same time can obtain higher code coverage rate, the generation of the test data in the fuzzy test data by the multi-population genetic algorithm is proved to have obvious guiding significance, the efficiency of the crash vulnerability is found to be obviously improved compared with AFL, and the generated test data can cover the execution path of a specific program.

Drawings

FIG. 1 is a flow chart of the binary program fuzzy test method based on multi-population genetic algorithm of the present invention.

Fig. 2 is a schematic diagram illustrating an exemplary basic block of the present invention.

Detailed Description

In order to better illustrate the objects and advantages of the present invention, embodiments of the method of the present invention are described in further detail below with reference to examples.

The specific process is as follows:

step 1, population initialization.

Step 1.1, first the main population and the sub-population are composed of a number of individuals, each of which can be abstractly expressed as a chromosome, then the ith individual in the population can be expressed as X_i＝(x_i,1,x_i,2,x_i,3,...,x_i,D). The process of population initialization is X_iEach gene x in_i,dAssigned value of each x_i,dRepresents one byte, and the length D is the number of bytes of the test data.

Step 1.2, the invention initializes the main population and the sub population by using a random assignment mode. Both class1 and class2 sub-populations were tested using a population in which class2 sub-population initializations are negated from the binary code of each of class1 sub-populations.

And 2, positioning the basic block.

And 2.1, Qemu instrumentation can acquire basic block information in the program execution process. Qemu simulates the process of executing a program, and divides the program into basic blocks for translation and execution.

And 2.2, inserting a section of code for outputting the information of the currently executed basic block before the Qemu executes the basic block, obtaining a basic block sequence corresponding to the program execution process when the Qemu simulates the execution program, namely the execution path information of the program, and recording the basic block sequence to an external file.

And 3, tracking and executing the tested program.

Step 3.1, representing each basic block in the program by its entry address b, a sequence of basic blocks is obtained by tracking the program execution (b)₁,b₂,b₃,...,b_n). Defining a jump between two consecutive basic blocks in an execution path as e ═ b_k，b_k+1) Then E is an edge in the program execution path (as shown in fig. 2) using the basic block as a node, and the program execution path can be represented as a sequence E of edges_e＝(e₁,e₂,e₃,...,e_n-1)。

Step 3.2, merge sequence E_eThe same side in the sequence table is obtained to obtain a set E 'of sides containing the appearance frequency information'_e＝(e’₁,e‘₂,e’₃,...,e‘_n-1). The number of occurrences of the same edge may be different in different program executions, we divided it into 8 different types: 1.2, 3, 4-7, 8-15, 16-31, 32-127, not less than 128. These 8 types can be represented using different bits of a byte, facilitating programming implementation. After classification, a new set E of edges is obtained "_e＝{e”₁,e“₂,e”₃,...,e“_n-1}。

Step 3.3, for each individual X in the main population and the sub-population_iThe corresponding program basic block sequence is processed by the method, and finally the execution path information of the program, namely the edge set E is obtained_i＝{e_i，1,e_i，2,e_i，3,...}。

And 4, calculating and sequencing fitness, and then selecting excellent individuals of the population.

Step 4.1, defining the set of all the discovered edges in the whole fuzzy test process as E_t＝{e_t，1,e_t，2,e_t，3,...}. Through f₁(X_i)＝card(E_i-E_t) The number f1 of newly found edges of the test data after execution in the program under test is calculated.

Step 4.2, update the set E in the population_tAnd W_t. For set E_tAny one side e_t,iSuppose that the last test data to find this edge is X_t,iObtaining a set W with one edge corresponding to the test data_t＝{(e_t,1,X_t,1),(e_t,2,X_t,2),(e_t,3,X_t,3),...}. And use the function

Computing a set W_tF2, where W (e) is the number of edges from the set W_tAnd obtaining test data corresponding to the edge e, wherein R (x, y) is a binary function, when x and y are the same, the function returns to 1, and otherwise, the function returns to 0.

And 4.3, calculating adaptive values and sequencing of the sub-populations independently of the main population. First, f of each individual in the population is calculated₁Then updating the set E in the population_tAnd W_tFinally, f of each individual is calculated₂. When two individuals are subjected to fitness comparison, firstly, f is compared₁The value of (c) is compared in case of no distinction₂The value of (c). This allows selection of superior individuals from the main and sub-populations.

And 5, transferring the sub-population to the main population and carrying out cross variation in the population.

And 5.1, adding the top 20 percent of excellent individuals in the sub-population into the excellent individuals in the main population for crossing and mutation.

Step 5.2, the crossover process uses 2-opt exchange. When the main population crosses, the length of the chromosome of one individual is D, four random numbers between 0 and D are firstly generated to be used as cross points, and then the fragments between two cross points in the chromosome are exchanged pairwise. Similarly, subgroup 1 of class1 and subgroup 2 of class2 employ one and three pairs of intersections, respectively.

And 5.3, when the main population is changed, generating two random numbers between 0 and D as change points, and then replacing genes at the change points by using randomly generated values. Similarly, sub-population 1 and sub-population 2 randomly generate one and two variation points, respectively.

And 5.4, bringing the newly obtained excellent individuals of the main population and the sub-population into the tested program to be executed, and repeating the steps 3 to 5.

And (3) testing results: the experiment tests the newly found edges of the three tested programs in the specified time, and the experimental result shows that the test data generated in the two groups of tests can obtain higher code coverage rate in the same time, and the code coverage rate is improved by more than 27% compared with AFL, namely higher code coverage rate is obtained. In addition, the test is carried out 100 times to the 9 tested programs with the holes to obtain the average value, and the test result shows that the efficiency of the invention for finding the collapse hole is improved by 13 percent compared with the AFL in the test of all the 9 tested programs.

The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The binary program fuzzy testing method based on the multi-population genetic algorithm is characterized by comprising the following steps of:

step 3, firstly obtaining the basic block sequence and the corresponding edge sequence, then combining the same edges to obtain the edge set E containing the occurrence frequency information_eThen, the times of occurrence of each edge are divided into 8 types and are represented by different bits of a byte, and a new edge set E 'is obtained after classification'_eFinally, the execution path information of the program, i.e., the set of edges E 'is obtained'_e；

Step 4.1, define ith individual in population as X_i＝(x_i，1，x_i，2，x_i，3，...，x_i，D) The set of all the edges found in the whole fuzzy test process is E_t＝{e_t，1，e_t，2，e_t，3,., defining the execution path information of the program, i.e. the set of edges is E_i＝{e_i，1，e_i，2，e_i，3,., defining the number of elements in the set A as card (A), and passing f₁(X_i)＝card(E_i-E_t) Calculating the number of newly found edges of the test data after the test data is executed in the tested program;

step 4.2, for the set E of all the discovered edges in the whole fuzzy test process_t＝{e_t，1，e_t，2，e_t，3,.. } any one edge e_t，iSuppose that the last test data to find this edge is X_t，iThen a set W is obtained in which one edge corresponds to the test data_t＝{(e_t，1，X_t，1)，(e_t，2，X_t，2)，(e_t，3，X_t，3) ,.., and using a function

Computing a set W_tF2, where W (e) is the number of edges from the set W_tObtaining test data corresponding to the edge e, wherein R (X, y) is a binary function, when X and y are the same, the function returns to 1, otherwise, the function returns to 0, and X is the other condition_i＝(x_i，1，x_i，2，x_i，3，...，x_i，D) Is the ith individual in the population;

and 4.3, comparing the fitness of the two test data, namely firstly comparing f1 values of the two test data, and if the two test data are equal, updating the set E_tAnd W_tFinally f of the test data is calculated₂Comparing the values;

and 5, using 2-opt exchange in the crossing process, randomly generating 0-D crossing points, setting different crossing rates and variation rates for different sub-populations by taking the main population as a reference, wherein one is lower than the main population and the other is higher than the main population, and thus avoiding the algorithm from falling into premature convergence.

2. The multi-population genetic algorithm based binary program fuzzy test method of claim 1 further comprising: step 3, recording an execution path by recording a basic block starting point, merging the same edges to obtain a set of edges containing the information of the number of the edges, then dividing the number of the edges into n types, representing each type in the n types by one or more bytes, and finally obtaining a set E 'of the edges representing the information of the program execution path'_eThe elements of the edges in the set contain the edge occurrence number category information.

3. The multi-population genetic algorithm based binary program fuzzy test method of claim 1 further comprising: step 4.2 define set W_t＝{(e_t，1，X_t，1)，(e_t，2，X_t，2)，(e_t，3，X_t，3) ,., to record edges and test data information relating to the edges, and to use them to find f2 values in fitness calculations,

i.e. the number of edges among all found edges that are relevant to the test data.

4. The multi-population genetic algorithm based binary program fuzz testing of claim 1The method is characterized in that: step 4 by f₁(X_i)＝card(E_i-E_t) Computing a set E of all discovered edges after execution of the current test data_tThe increment of the number of the medium elements is used as the value of the fitness index f1 of the individual.

5. The multi-population genetic algorithm based binary program fuzzy test method of claim 1 further comprising: the newly found edge f1 is preferably considered in the step 4 fitness calculation, and then all relevant edges of the test data are considered, because finding the test data of the new execution path is more significant.

6. The multi-population genetic algorithm based binary program fuzzy test method of claim 1 further comprising: step 5, in the genetic operation, the cross rate and the variation rate of the sub-population 1 of class1 are lower than those of the main population, and the cross rate and the variation rate of the sub-population 2 of class2 are higher than those of the main population, so that the algorithm is prevented from falling into premature convergence.