CN117873888A

CN117873888A - Deep learning-based efficient PDF application program fuzzy test case generation method

Info

Publication number: CN117873888A
Application number: CN202410025515.0A
Authority: CN
Inventors: 江贺; 刘家豪; 任志磊; 李晓晨; 周志德
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-04-12

Abstract

The invention belongs to the field of software automation testing, relates to a technology for constructing fuzzy test cases, and particularly relates to a method for generating fuzzy test cases of a high-efficiency PDF application program based on deep learning. The invention utilizes deep learning models such as CNN, seq2Seq, transformer and the like, and realizes more efficient, high-quality and targeted PDF test case generation through the steps of data screening, object generation, object addition, efficient mutation and the like. The invention can be used for loopholes searching of application programs taking PDF file format as input, such as XPDF, MUPDF, POPPLER and other widely used open-source PDF application programs; the relevant steps of the invention can also be adapted according to the different input file formats of the application program, so that loopholes of the application program taking other highly structured file formats as input can be searched.

Description

Deep learning-based efficient PDF application program fuzzy test case generation method

Technical Field

The invention belongs to the field of software automation testing, relates to a technology for constructing fuzzy test cases, and particularly relates to a method for generating fuzzy test cases of a high-efficiency PDF application program based on deep learning.

Background

Fuzzy testing is a technique widely used in the fields of software development and security testing, and discovers hidden software vulnerabilities and potential security risks by inputting a large amount of random, invalid or abnormal data into a target program to test its stability and reliability. In addition, the fuzzy test can be automatically performed, so that a large amount of data can be rapidly executed, and the fuzzy test is one of the most widely applied and successful technologies in the field of software engineering. The fuzzy test can be classified into a mutation-based fuzzy test and a grammar-based fuzzy test according to the generation manner of the test cases. Mutation-based fuzzy testing generates a large number of mutation inputs by making various modifications on the basis of known valid inputs, and this method generally does not consider the input specifications of the target application program, but relies on randomness and variability to discover potential vulnerabilities or errors, so that when the input specifications of the target application program are very complex or the format of input data is strict, the code coverage of the mutation-based fuzzy testing is very limited. The grammar-based fuzzy test is a test method based on input grammar rules, which utilizes the grammar rules or file format definition of a target program to generate effective input data meeting grammar requirements and carries out variation and modification on the generated input. Thus, grammar-based fuzzy testing is currently the most effective technique for fuzzy testing of programs requiring highly structured inputs (e.g., PDF applications), which is more constrained and accurate in generating inputs, and which can more purposefully explore potential problems while reducing the number of invalid test cases. However, grammar-based fuzzy testing also faces the hurdle of requiring a significant amount of labor to summarize the input format grammar rules.

Deep learning based blur testing overcomes this obstacle by introducing deep learning techniques into grammar based blur testing. Specifically, deep learning is mainly used to automatically learn grammar from a corpus set conforming to the rules of input grammar, so that test cases with higher pass rate can be generated for fuzzy test. In addition, many stages of the obfuscation process can be regarded as classification problems that are well suited to be solved using deep learning algorithms, such as whether seed files and test cases are valid, availability of crashes, and which mutation operators to select. However, the current fuzzy test method based on deep learning for the PDF program generally faces challenges such as uneven quality of training set data, inability to maintain balance between coverage rate increase and test case size increase, lack of efficient and stable variation method, and the like, so that the generated test case codes have low coverage rate, poor loophole searching capability and poor pertinence, and seriously influence the efficiency of fuzzy test.

Disclosure of Invention

In order to solve the problems, the invention provides a method for generating a fuzzy test case of a high-efficiency PDF application program based on deep learning.

The technical scheme of the invention is as follows:

a method for generating fuzzy test cases of high-efficiency PDF application programs based on deep learning comprises the following steps:

step 1: and collecting a large number of PDF files from the network public data set and the website, removing repeated and damaged files in the PDF files, and constructing the data set.

Step 2: and extracting a small quantity of PDF files in the data set to repeatedly perform fuzzy test on the tested program, and extracting test cases which can cause the tested program to fall into a crash or deadlock circulation state and test cases which cannot cause the tested program to fall into the crash or deadlock circulation state from the test results. For each test case, it is extracted in binary form and expressed as a vector x= < x ₁ ，x ₂ ，...，x _n >The length n=1 to i of the vector is the length x of the largest PDF file _i ＝byte _i (x _i E {0, …,255 }) additionally using x _i =256 to fill the portion with a length smaller than n. Label 1 is given to a file that causes a tested program to crash or deadlock cycle, and vice versaLabel 0, training CNN classification model.

Step 3: and screening data which possibly causes the tested program to fall into a crash or deadlock circulation state in the rest data in the data set by using the trained CNN classification model, namely judging the PDF file as the label 1.

Step 4: the method comprises the steps of extracting objects in a PDF file obtained by screening a CNN classification model, replacing binary data in the PDF object by using a special sequence stream, and additionally storing the replaced binary data to realize separate storage of text data and binary data. Cutting the text data and the binary data into fixed lengths respectively, converting the fixed lengths into vectors in the mode of step 2, and training the Seq2Seq prediction model respectively.

Step 5: randomly extracting an initial sequence of text data and binary data of the PDF object, converting the initial sequence into vectors in the mode of step 2, inputting a trained Seq2Seq model, selecting elements obtained by a sampling algorithm, splicing the elements into the input vectors to form new input vectors, inputting the new input vectors into the model again, repeating the process until complete text data and binary data are generated, and then combining the text data and the binary data into the complete PDF object. Ideally, the Seq2Seq model should generate as diverse text data and binary data as possible, but in practice, the model is generated with often the same result, since the probability relationship has been determined after model training is completed, resulting in an initial sequence being entered. The method for overcoming the problem is that a sampling algorithm is used in the predictive generation process of the Seq2Seq model, the input sequence Seq and the trained Seq2Seq model are executed with a threshold value t_fuzz, the sampling threshold value p_t is used as the input of the sampling algorithm, and the diversity of text data and binary data is improved by the sampling algorithm.

Step 6: and attaching the generated PDF object to the complete PDF file. The balance of the generated PDF test case coverage rate increase and the file size increase is realized by using two strategies. According to the AOU (All objects update) object attaching method, PDF objects generated by the Seq2Seq model are used for replacing each object in the complete PDF file in sequence, and a new PDF file is correspondingly generated, so that coverage rate is increased, and file size is reduced; in-place PDF file updating method, change PDF file standard increment updating method new PDF object, cross reference table and mode of attaching file tail to PDF file end, delete the PDF object to be covered directly, insert the PDF object to be updated into its original position, and modify cross reference table again, so as to reduce PDF file size increase.

Step 7: and training a transducer model. Randomly extracting N PDF files from the data set, generating M variants for each PDF file in a random variation mode, and obtaining the variant with the most improved coverage rate compared with the original file. And (2) converting the N PDF original files and the N variant files with highest coverage rate into vectors in the mode of step (2), and respectively training a transducer model as an input vector and a target vector to obtain the efficient variant model.

Step 8: and (3) inputting the PDF file obtained in the step (6) into a transducer model for efficient mutation to obtain a final test case for executing the fuzzy test.

The invention has the beneficial effects that: the invention overcomes the challenges in the prior art by utilizing deep learning models such as CNN, seq2Seq, transformer and the like and through the steps of data screening, object generation, object addition, efficient mutation and the like, thereby realizing the generation of PDF test cases with higher efficiency, higher quality and pertinence. The invention can be used for loopholes searching of application programs taking PDF file format as input, such as XPDF, MUPDF, POPPLER and other widely used open-source PDF application programs; the relevant steps of the invention can also be adapted according to the different input file formats of the application program, so that loopholes of the application program taking other highly structured file formats as input can be searched.

Drawings

Fig. 1 is a flowchart of a method for generating a fuzzy test case of a high-efficiency PDF application program based on deep learning.

Fig. 2 is a flow chart of a sampling algorithm.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.

As shown in fig. 1, PDF fuzzy test case generation proceeds as follows. In addition to the determining factors specified in the generation process, other factors, such as an execution threshold of a sampling algorithm, a sampling threshold, a termination condition of the sampling algorithm, the number of files of a transducer training set, the number of variant files and the like, are set according to the specific requirements of the tested program.

Step 2: and extracting a small quantity of PDF files in the data set to repeatedly perform fuzzy test on the tested program, and extracting test cases which can cause the tested program to fall into a crash or deadlock circulation state and test cases which cannot cause the tested program to fall into the crash or deadlock circulation state from the test results. For each test case, it is extracted in binary form and expressed as a vector x= < x ₁ ，x ₂ ，...，x _n >And the length n of the vector is the length x of the largest PDF file _i ＝byte _i (x _i E {0, …,255 }) additionally using x _i =256 to fill the portion with a length smaller than n. And (3) assigning a label 1 to the file which causes the tested program to crash or deadlock circulation, otherwise assigning a label 0 to the file, and training a CNN classification model.

Since many tested programs modify the committed bugs in real time, it may not be possible to find a test case in the latest version of the tested program that can cause the tested program to fall into a crash or deadlock cycle state in step 2. In this case, the invention supplements step 2 by either fuzzing with an earlier version containing the error or manually downloading from the error reporting forum of the program the reported test cases that trigger the program crash. Considering the twenty-eight rule existing in the software testing field, error codes tend to show a centralized distribution trend, that is, other errors may exist in the vicinity of the error codes. Therefore, even if the vulnerability corresponding to the test case obtained in the step 2 has been found or repaired, the features of the files can still be learned through the model to generate a new PDF object, and a new PDF test case is constructed. The newly generated PDF test case can cover other codes nearby the original bug code through subsequent steps of sampling, mutation and the like, so that other errors hidden nearby the error code or new errors introduced by the bug modification code can be found.

Step 3: and screening data which is more likely to cause the tested program to fall into a crash or deadlock circulation state in the rest most data in the data set by using the trained CNN classification model, namely judging the data to be the PDF file of the label 1.

Step 4: the objects in the PDF file obtained by the screening of the CNN model are extracted, and the PDF objects can be divided into two types according to the difference of data types, namely plain text objects and objects containing binary data. Because of the difference in the characteristics of text data and binary data, separate processing is required. Specifically, the method replaces binary data in the PDF object by a special text sequence stream, and the replaced binary data is additionally stored, so that the separation storage of the text data and the binary data is realized. Then cutting the text data and the binary data into fixed lengths respectively, converting the fixed lengths into vectors in the mode of step 2, and training the Seq2Seq prediction model respectively.

Step 5: randomly extracting random length initial sequence sequences of PDF object text data and binary data, converting the random length initial sequence sequences into vectors in the step 2, inputting a trained Seq2Seq model, selecting elements obtained by a sampling algorithm, splicing the elements into the input vectors to form new input vectors, inputting the new input vectors into the model again, and repeating the process until complete text data and binary data are generated. And then replacing the stream sequence in the text data by using the binary data to generate a complete PDF object. Ideally, the Seq2Seq model should generate as diverse text data and binary data as possible, but in practice, the model is generated with often the same result, since the probability relationship has been determined after model training is completed, resulting in an initial sequence being entered. The method for overcoming the problem is that a sampling algorithm is used in the prediction generation process of the Seq2Seq model, the input sequence Seq and the trained Seq2Seq model are executed, a threshold t_fuzz is executed, a sampling threshold p_t is used as the input of the sampling algorithm, and the diversity of text data and binary data is improved by the sampling algorithm. In each prediction process, firstly, a Seq2Seq model is operated to obtain the next element c with the highest probability of the current sequence and the probability p (c) thereof, then, a number p_fuzz in 0-1 is randomly obtained, if p_fuzz > t_fuzz and p (c) p_t, an integer s in 1-4 is randomly obtained, and the element with the smallest probability, the random element, the repeated element and the null are respectively used as the next element to splice to the end of the current sequence to form a new input sequence according to the difference of the values of s, then, the new input sequence is input into the model, and the process is repeated until the end condition is reached. The above process is shown in fig. 2.

Step 6: and attaching the generated PDF object to the complete PDF file. This step uses two strategies to achieve a balance of the generated PDF test case coverage growth and file size growth. AOU (All objects update) object attachment method uses the PDF object generated by the Seq2Seq model to replace each object in the complete PDF file in turn, and correspondingly generates a new PDF file. For example, the complete PDF file has 3 objects, namely pdf= [ obj1, obj2, obj3], three new PDF files are generated, namely PDF 1= [ new_obj, obj2, obj3], PDF 2= [ obj1, new_obj, obj3], PDF 3= [ obj1, obj2, new_obj ], which can ensure that each object in the PDF file is covered to improve coverage rate increase, and can avoid excessive increase of file size caused by adding a plurality of objects in one file; in-place PDF file updating method, the new PDF object, cross reference table and file tail are added at the end of PDF file In the PDF file standard increment updating method, the PDF object to be covered is deleted directly, the PDF object to be updated is inserted into its original position, and the cross reference table is modified again, so as to reduce the size increase of PDF file.

Step 7: and training a transform efficient mutation model. Randomly extracting N PDF files from the data set, generating M variants for each PDF file in a random variation mode, and obtaining the file with the most improved coverage rate compared with the original file in the variant files. And (3) converting the N PDF original files and the N coverage rate lifting highest variants into vectors in the mode of step (2) and respectively serving as input vectors and target vectors to train a transducer model. Taking MUPDF as an example, a-coverage option can be added when the GCC compiler is used for compiling the MUPDF, so that the code is inserted, then the MUPDF is used for operating the variant file, the GCOV tool can be used for automatically analyzing and obtaining statement coverage information, and then the variant file with the highest coverage rate improvement can be obtained by comparing the statement coverage information. The GCC and the GCOV are open source free software, and a user can select which tool to use to acquire coverage rate information according to specific tested programs and application scenes in actual use.

Claims

1. A method for generating a fuzzy test case of a high-efficiency PDF application program based on deep learning is characterized by comprising the following steps:

step 1: collecting a large number of PDF files from a network public data set and websites, removing repeated and damaged files in the PDF files, and constructing a data set;

step 2: extracting a small quantity of PDF files in the data set to carry out repeated fuzzy test on the tested program, and extracting test cases which can cause the tested program to fall into a crash or deadlock circulation state and test cases which cannot cause the tested program to fall into the crash or deadlock circulation state from test results; for each test case, it is extracted in binary form and expressed as a vector x= < x ₁ ，x ₂ ，...，x _n >The length n=1 to i of the vector is the length x of the largest PDF file _i ＝byte _i (x _i E {0, …,255 }) additionally using x _i =256 to fill the portion with a length smaller than n; a label 1 is given to a file which causes the tested program to crash or deadlock and circulates, otherwise, a label 0 is given to the file, and a CNN classification model is trained;

step 3: screening data which possibly causes the tested program to fall into a crash or deadlock circulation state in the residual data in the data set by using a trained CNN classification model, namely judging the PDF file as the label 1;

step 4: extracting objects in the PDF file obtained by screening the CNN binary model, replacing binary data in the PDF object by using a special sequence stream, and additionally storing the replaced binary data to realize separate storage of text data and binary data; cutting text data and binary data into fixed lengths respectively, converting the text data and the binary data into vectors in the mode of step 2, and training a Seq2Seq prediction model respectively;

step 5: randomly extracting an initial sequence of text data and binary data of a PDF object, converting the initial sequence into vectors in the mode of step 2, inputting a trained Seq2Seq model, selecting elements obtained by a sampling algorithm, splicing the elements into the input vectors to form a new input vector, inputting the new input vector into the model again, repeating the process until complete text data and binary data are generated, and then combining the text data and the binary data into a complete PDF object; in the predictive generation process of the Seq2Seq model, a sampling algorithm is used, an input sequence Seq and a trained Seq2Seq model are executed, a threshold t_fuzz is executed, a sampling threshold p_t is used as the input of the sampling algorithm, and the diversity of text data and binary data is improved by the sampling algorithm;

step 6: attaching the generated PDF object to a complete PDF file; the balance between the coverage rate increase and the file size increase of the generated PDF test cases is realized by using two strategies; according to the AOU object attachment method, PDF objects generated by a Seq2Seq model are used for replacing each object in a complete PDF file in sequence, and a new PDF file is correspondingly generated, so that coverage rate is increased, and file size is reduced; in-place PDF file updating method, change PDF file standard increment updating method new PDF object, cross reference table and mode of attaching file tail to PDF file end, delete the PDF object to be covered directly, insert the PDF object to be updated into its original position, and modify the cross reference table again, in order to reduce PDF file size and increase;

step 7: training a transducer model; randomly extracting N PDF files from the data set, generating M variants for each PDF file in a random variation mode, and obtaining the variant with the most improvement compared with the original file coverage rate in the variants; converting N PDF original files and N variant files with highest coverage rate into vectors in the mode of step 2, respectively training a transducer model as an input vector and a target vector, and obtaining a high-efficiency variant model;