CN117873888A - Deep learning-based efficient PDF application program fuzzy test case generation method - Google Patents

Deep learning-based efficient PDF application program fuzzy test case generation method Download PDF

Info

Publication number
CN117873888A
CN117873888A CN202410025515.0A CN202410025515A CN117873888A CN 117873888 A CN117873888 A CN 117873888A CN 202410025515 A CN202410025515 A CN 202410025515A CN 117873888 A CN117873888 A CN 117873888A
Authority
CN
China
Prior art keywords
pdf
file
model
data
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410025515.0A
Other languages
Chinese (zh)
Inventor
江贺
刘家豪
任志磊
李晓晨
周志德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202410025515.0A priority Critical patent/CN117873888A/en
Publication of CN117873888A publication Critical patent/CN117873888A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention belongs to the field of software automation testing, relates to a technology for constructing fuzzy test cases, and particularly relates to a method for generating fuzzy test cases of a high-efficiency PDF application program based on deep learning. The invention utilizes deep learning models such as CNN, seq2Seq, transformer and the like, and realizes more efficient, high-quality and targeted PDF test case generation through the steps of data screening, object generation, object addition, efficient mutation and the like. The invention can be used for loopholes searching of application programs taking PDF file format as input, such as XPDF, MUPDF, POPPLER and other widely used open-source PDF application programs; the relevant steps of the invention can also be adapted according to the different input file formats of the application program, so that loopholes of the application program taking other highly structured file formats as input can be searched.

Description

Deep learning-based efficient PDF application program fuzzy test case generation method
Technical Field
The invention belongs to the field of software automation testing, relates to a technology for constructing fuzzy test cases, and particularly relates to a method for generating fuzzy test cases of a high-efficiency PDF application program based on deep learning.
Background
Fuzzy testing is a technique widely used in the fields of software development and security testing, and discovers hidden software vulnerabilities and potential security risks by inputting a large amount of random, invalid or abnormal data into a target program to test its stability and reliability. In addition, the fuzzy test can be automatically performed, so that a large amount of data can be rapidly executed, and the fuzzy test is one of the most widely applied and successful technologies in the field of software engineering. The fuzzy test can be classified into a mutation-based fuzzy test and a grammar-based fuzzy test according to the generation manner of the test cases. Mutation-based fuzzy testing generates a large number of mutation inputs by making various modifications on the basis of known valid inputs, and this method generally does not consider the input specifications of the target application program, but relies on randomness and variability to discover potential vulnerabilities or errors, so that when the input specifications of the target application program are very complex or the format of input data is strict, the code coverage of the mutation-based fuzzy testing is very limited. The grammar-based fuzzy test is a test method based on input grammar rules, which utilizes the grammar rules or file format definition of a target program to generate effective input data meeting grammar requirements and carries out variation and modification on the generated input. Thus, grammar-based fuzzy testing is currently the most effective technique for fuzzy testing of programs requiring highly structured inputs (e.g., PDF applications), which is more constrained and accurate in generating inputs, and which can more purposefully explore potential problems while reducing the number of invalid test cases. However, grammar-based fuzzy testing also faces the hurdle of requiring a significant amount of labor to summarize the input format grammar rules.
Deep learning based blur testing overcomes this obstacle by introducing deep learning techniques into grammar based blur testing. Specifically, deep learning is mainly used to automatically learn grammar from a corpus set conforming to the rules of input grammar, so that test cases with higher pass rate can be generated for fuzzy test. In addition, many stages of the obfuscation process can be regarded as classification problems that are well suited to be solved using deep learning algorithms, such as whether seed files and test cases are valid, availability of crashes, and which mutation operators to select. However, the current fuzzy test method based on deep learning for the PDF program generally faces challenges such as uneven quality of training set data, inability to maintain balance between coverage rate increase and test case size increase, lack of efficient and stable variation method, and the like, so that the generated test case codes have low coverage rate, poor loophole searching capability and poor pertinence, and seriously influence the efficiency of fuzzy test.
Disclosure of Invention
In order to solve the problems, the invention provides a method for generating a fuzzy test case of a high-efficiency PDF application program based on deep learning.
The technical scheme of the invention is as follows:
a method for generating fuzzy test cases of high-efficiency PDF application programs based on deep learning comprises the following steps:
step 1: and collecting a large number of PDF files from the network public data set and the website, removing repeated and damaged files in the PDF files, and constructing the data set.
Step 2: and extracting a small quantity of PDF files in the data set to repeatedly perform fuzzy test on the tested program, and extracting test cases which can cause the tested program to fall into a crash or deadlock circulation state and test cases which cannot cause the tested program to fall into the crash or deadlock circulation state from the test results. For each test case, it is extracted in binary form and expressed as a vector x= < x 1 ,x 2 ,...,x n >The length n=1 to i of the vector is the length x of the largest PDF file i =byte i (x i E {0, …,255 }) additionally using x i =256 to fill the portion with a length smaller than n. Label 1 is given to a file that causes a tested program to crash or deadlock cycle, and vice versaLabel 0, training CNN classification model.
Step 3: and screening data which possibly causes the tested program to fall into a crash or deadlock circulation state in the rest data in the data set by using the trained CNN classification model, namely judging the PDF file as the label 1.
Step 4: the method comprises the steps of extracting objects in a PDF file obtained by screening a CNN classification model, replacing binary data in the PDF object by using a special sequence stream, and additionally storing the replaced binary data to realize separate storage of text data and binary data. Cutting the text data and the binary data into fixed lengths respectively, converting the fixed lengths into vectors in the mode of step 2, and training the Seq2Seq prediction model respectively.
Step 5: randomly extracting an initial sequence of text data and binary data of the PDF object, converting the initial sequence into vectors in the mode of step 2, inputting a trained Seq2Seq model, selecting elements obtained by a sampling algorithm, splicing the elements into the input vectors to form new input vectors, inputting the new input vectors into the model again, repeating the process until complete text data and binary data are generated, and then combining the text data and the binary data into the complete PDF object. Ideally, the Seq2Seq model should generate as diverse text data and binary data as possible, but in practice, the model is generated with often the same result, since the probability relationship has been determined after model training is completed, resulting in an initial sequence being entered. The method for overcoming the problem is that a sampling algorithm is used in the predictive generation process of the Seq2Seq model, the input sequence Seq and the trained Seq2Seq model are executed with a threshold value t_fuzz, the sampling threshold value p_t is used as the input of the sampling algorithm, and the diversity of text data and binary data is improved by the sampling algorithm.
Step 6: and attaching the generated PDF object to the complete PDF file. The balance of the generated PDF test case coverage rate increase and the file size increase is realized by using two strategies. According to the AOU (All objects update) object attaching method, PDF objects generated by the Seq2Seq model are used for replacing each object in the complete PDF file in sequence, and a new PDF file is correspondingly generated, so that coverage rate is increased, and file size is reduced; in-place PDF file updating method, change PDF file standard increment updating method new PDF object, cross reference table and mode of attaching file tail to PDF file end, delete the PDF object to be covered directly, insert the PDF object to be updated into its original position, and modify cross reference table again, so as to reduce PDF file size increase.
Step 7: and training a transducer model. Randomly extracting N PDF files from the data set, generating M variants for each PDF file in a random variation mode, and obtaining the variant with the most improved coverage rate compared with the original file. And (2) converting the N PDF original files and the N variant files with highest coverage rate into vectors in the mode of step (2), and respectively training a transducer model as an input vector and a target vector to obtain the efficient variant model.
Step 8: and (3) inputting the PDF file obtained in the step (6) into a transducer model for efficient mutation to obtain a final test case for executing the fuzzy test.
The invention has the beneficial effects that: the invention overcomes the challenges in the prior art by utilizing deep learning models such as CNN, seq2Seq, transformer and the like and through the steps of data screening, object generation, object addition, efficient mutation and the like, thereby realizing the generation of PDF test cases with higher efficiency, higher quality and pertinence. The invention can be used for loopholes searching of application programs taking PDF file format as input, such as XPDF, MUPDF, POPPLER and other widely used open-source PDF application programs; the relevant steps of the invention can also be adapted according to the different input file formats of the application program, so that loopholes of the application program taking other highly structured file formats as input can be searched.
Drawings
Fig. 1 is a flowchart of a method for generating a fuzzy test case of a high-efficiency PDF application program based on deep learning.
Fig. 2 is a flow chart of a sampling algorithm.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings and technical schemes.
As shown in fig. 1, PDF fuzzy test case generation proceeds as follows. In addition to the determining factors specified in the generation process, other factors, such as an execution threshold of a sampling algorithm, a sampling threshold, a termination condition of the sampling algorithm, the number of files of a transducer training set, the number of variant files and the like, are set according to the specific requirements of the tested program.
Step 1: and collecting a large number of PDF files from the network public data set and the website, removing repeated and damaged files in the PDF files, and constructing the data set.
Step 2: and extracting a small quantity of PDF files in the data set to repeatedly perform fuzzy test on the tested program, and extracting test cases which can cause the tested program to fall into a crash or deadlock circulation state and test cases which cannot cause the tested program to fall into the crash or deadlock circulation state from the test results. For each test case, it is extracted in binary form and expressed as a vector x= < x 1 ,x 2 ,...,x n >And the length n of the vector is the length x of the largest PDF file i =byte i (x i E {0, …,255 }) additionally using x i =256 to fill the portion with a length smaller than n. And (3) assigning a label 1 to the file which causes the tested program to crash or deadlock circulation, otherwise assigning a label 0 to the file, and training a CNN classification model.
Since many tested programs modify the committed bugs in real time, it may not be possible to find a test case in the latest version of the tested program that can cause the tested program to fall into a crash or deadlock cycle state in step 2. In this case, the invention supplements step 2 by either fuzzing with an earlier version containing the error or manually downloading from the error reporting forum of the program the reported test cases that trigger the program crash. Considering the twenty-eight rule existing in the software testing field, error codes tend to show a centralized distribution trend, that is, other errors may exist in the vicinity of the error codes. Therefore, even if the vulnerability corresponding to the test case obtained in the step 2 has been found or repaired, the features of the files can still be learned through the model to generate a new PDF object, and a new PDF test case is constructed. The newly generated PDF test case can cover other codes nearby the original bug code through subsequent steps of sampling, mutation and the like, so that other errors hidden nearby the error code or new errors introduced by the bug modification code can be found.
Step 3: and screening data which is more likely to cause the tested program to fall into a crash or deadlock circulation state in the rest most data in the data set by using the trained CNN classification model, namely judging the data to be the PDF file of the label 1.
Step 4: the objects in the PDF file obtained by the screening of the CNN model are extracted, and the PDF objects can be divided into two types according to the difference of data types, namely plain text objects and objects containing binary data. Because of the difference in the characteristics of text data and binary data, separate processing is required. Specifically, the method replaces binary data in the PDF object by a special text sequence stream, and the replaced binary data is additionally stored, so that the separation storage of the text data and the binary data is realized. Then cutting the text data and the binary data into fixed lengths respectively, converting the fixed lengths into vectors in the mode of step 2, and training the Seq2Seq prediction model respectively.
Step 5: randomly extracting random length initial sequence sequences of PDF object text data and binary data, converting the random length initial sequence sequences into vectors in the step 2, inputting a trained Seq2Seq model, selecting elements obtained by a sampling algorithm, splicing the elements into the input vectors to form new input vectors, inputting the new input vectors into the model again, and repeating the process until complete text data and binary data are generated. And then replacing the stream sequence in the text data by using the binary data to generate a complete PDF object. Ideally, the Seq2Seq model should generate as diverse text data and binary data as possible, but in practice, the model is generated with often the same result, since the probability relationship has been determined after model training is completed, resulting in an initial sequence being entered. The method for overcoming the problem is that a sampling algorithm is used in the prediction generation process of the Seq2Seq model, the input sequence Seq and the trained Seq2Seq model are executed, a threshold t_fuzz is executed, a sampling threshold p_t is used as the input of the sampling algorithm, and the diversity of text data and binary data is improved by the sampling algorithm. In each prediction process, firstly, a Seq2Seq model is operated to obtain the next element c with the highest probability of the current sequence and the probability p (c) thereof, then, a number p_fuzz in 0-1 is randomly obtained, if p_fuzz > t_fuzz and p (c) p_t, an integer s in 1-4 is randomly obtained, and the element with the smallest probability, the random element, the repeated element and the null are respectively used as the next element to splice to the end of the current sequence to form a new input sequence according to the difference of the values of s, then, the new input sequence is input into the model, and the process is repeated until the end condition is reached. The above process is shown in fig. 2.
Step 6: and attaching the generated PDF object to the complete PDF file. This step uses two strategies to achieve a balance of the generated PDF test case coverage growth and file size growth. AOU (All objects update) object attachment method uses the PDF object generated by the Seq2Seq model to replace each object in the complete PDF file in turn, and correspondingly generates a new PDF file. For example, the complete PDF file has 3 objects, namely pdf= [ obj1, obj2, obj3], three new PDF files are generated, namely PDF 1= [ new_obj, obj2, obj3], PDF 2= [ obj1, new_obj, obj3], PDF 3= [ obj1, obj2, new_obj ], which can ensure that each object in the PDF file is covered to improve coverage rate increase, and can avoid excessive increase of file size caused by adding a plurality of objects in one file; in-place PDF file updating method, the new PDF object, cross reference table and file tail are added at the end of PDF file In the PDF file standard increment updating method, the PDF object to be covered is deleted directly, the PDF object to be updated is inserted into its original position, and the cross reference table is modified again, so as to reduce the size increase of PDF file.
Step 7: and training a transform efficient mutation model. Randomly extracting N PDF files from the data set, generating M variants for each PDF file in a random variation mode, and obtaining the file with the most improved coverage rate compared with the original file in the variant files. And (3) converting the N PDF original files and the N coverage rate lifting highest variants into vectors in the mode of step (2) and respectively serving as input vectors and target vectors to train a transducer model. Taking MUPDF as an example, a-coverage option can be added when the GCC compiler is used for compiling the MUPDF, so that the code is inserted, then the MUPDF is used for operating the variant file, the GCOV tool can be used for automatically analyzing and obtaining statement coverage information, and then the variant file with the highest coverage rate improvement can be obtained by comparing the statement coverage information. The GCC and the GCOV are open source free software, and a user can select which tool to use to acquire coverage rate information according to specific tested programs and application scenes in actual use.
Step 8: and (3) inputting the PDF file obtained in the step (6) into a transducer model for efficient mutation to obtain a final test case for executing the fuzzy test.

Claims (1)

1. A method for generating a fuzzy test case of a high-efficiency PDF application program based on deep learning is characterized by comprising the following steps:
step 1: collecting a large number of PDF files from a network public data set and websites, removing repeated and damaged files in the PDF files, and constructing a data set;
step 2: extracting a small quantity of PDF files in the data set to carry out repeated fuzzy test on the tested program, and extracting test cases which can cause the tested program to fall into a crash or deadlock circulation state and test cases which cannot cause the tested program to fall into the crash or deadlock circulation state from test results; for each test case, it is extracted in binary form and expressed as a vector x= < x 1 ,x 2 ,...,x n >The length n=1 to i of the vector is the length x of the largest PDF file i =byte i (x i E {0, …,255 }) additionally using x i =256 to fill the portion with a length smaller than n; a label 1 is given to a file which causes the tested program to crash or deadlock and circulates, otherwise, a label 0 is given to the file, and a CNN classification model is trained;
step 3: screening data which possibly causes the tested program to fall into a crash or deadlock circulation state in the residual data in the data set by using a trained CNN classification model, namely judging the PDF file as the label 1;
step 4: extracting objects in the PDF file obtained by screening the CNN binary model, replacing binary data in the PDF object by using a special sequence stream, and additionally storing the replaced binary data to realize separate storage of text data and binary data; cutting text data and binary data into fixed lengths respectively, converting the text data and the binary data into vectors in the mode of step 2, and training a Seq2Seq prediction model respectively;
step 5: randomly extracting an initial sequence of text data and binary data of a PDF object, converting the initial sequence into vectors in the mode of step 2, inputting a trained Seq2Seq model, selecting elements obtained by a sampling algorithm, splicing the elements into the input vectors to form a new input vector, inputting the new input vector into the model again, repeating the process until complete text data and binary data are generated, and then combining the text data and the binary data into a complete PDF object; in the predictive generation process of the Seq2Seq model, a sampling algorithm is used, an input sequence Seq and a trained Seq2Seq model are executed, a threshold t_fuzz is executed, a sampling threshold p_t is used as the input of the sampling algorithm, and the diversity of text data and binary data is improved by the sampling algorithm;
step 6: attaching the generated PDF object to a complete PDF file; the balance between the coverage rate increase and the file size increase of the generated PDF test cases is realized by using two strategies; according to the AOU object attachment method, PDF objects generated by a Seq2Seq model are used for replacing each object in a complete PDF file in sequence, and a new PDF file is correspondingly generated, so that coverage rate is increased, and file size is reduced; in-place PDF file updating method, change PDF file standard increment updating method new PDF object, cross reference table and mode of attaching file tail to PDF file end, delete the PDF object to be covered directly, insert the PDF object to be updated into its original position, and modify the cross reference table again, in order to reduce PDF file size and increase;
step 7: training a transducer model; randomly extracting N PDF files from the data set, generating M variants for each PDF file in a random variation mode, and obtaining the variant with the most improvement compared with the original file coverage rate in the variants; converting N PDF original files and N variant files with highest coverage rate into vectors in the mode of step 2, respectively training a transducer model as an input vector and a target vector, and obtaining a high-efficiency variant model;
step 8: and (3) inputting the PDF file obtained in the step (6) into a transducer model for efficient mutation to obtain a final test case for executing the fuzzy test.
CN202410025515.0A 2024-01-08 2024-01-08 Deep learning-based efficient PDF application program fuzzy test case generation method Pending CN117873888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410025515.0A CN117873888A (en) 2024-01-08 2024-01-08 Deep learning-based efficient PDF application program fuzzy test case generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410025515.0A CN117873888A (en) 2024-01-08 2024-01-08 Deep learning-based efficient PDF application program fuzzy test case generation method

Publications (1)

Publication Number Publication Date
CN117873888A true CN117873888A (en) 2024-04-12

Family

ID=90584153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410025515.0A Pending CN117873888A (en) 2024-01-08 2024-01-08 Deep learning-based efficient PDF application program fuzzy test case generation method

Country Status (1)

Country Link
CN (1) CN117873888A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118277839A (en) * 2024-06-03 2024-07-02 贵州大学 BCTGAN data expansion method for extremely unbalanced data fault diagnosis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118277839A (en) * 2024-06-03 2024-07-02 贵州大学 BCTGAN data expansion method for extremely unbalanced data fault diagnosis
CN118277839B (en) * 2024-06-03 2024-07-26 贵州大学 BCTGAN data expansion method for extremely unbalanced data fault diagnosis

Similar Documents

Publication Publication Date Title
US11269622B2 (en) Methods, systems, articles of manufacture, and apparatus for a context and complexity-aware recommendation system for improved software development efficiency
US11157384B2 (en) Methods, systems, articles of manufacture and apparatus for code review assistance for dynamically typed languages
Ouni et al. Maintainability defects detection and correction: a multi-objective approach
CN108595341B (en) Automatic example generation method and system
US6282527B1 (en) Adaptive problem solving method and apparatus utilizing evolutionary computation techniques
US20080282108A1 (en) Program synthesis and debugging using machine learning techniques
CN108228469B (en) Test case selection method and device
CN117873888A (en) Deep learning-based efficient PDF application program fuzzy test case generation method
US8732676B1 (en) System and method for generating unit test based on recorded execution paths
US12026260B2 (en) Deep-learning based device and method for detecting source-code vulnerability with improved robustness
CN113591093B (en) Industrial software vulnerability detection method based on self-attention mechanism
CN113076545A (en) Deep learning-based kernel fuzzy test sequence generation method
CN116663019B (en) Source code vulnerability detection method, device and system
Xu et al. Tree2tree structural language modeling for compiler fuzzing
CN116702157A (en) Intelligent contract vulnerability detection method based on neural network
CN114707151B (en) Zombie software detection method based on API call and network behavior
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN110221838B (en) Method for carrying out automatic program design optimization based on genetic algorithm and directed acyclic graph
CN114445656A (en) Multi-label model processing method and device, electronic equipment and storage medium
CN112231650B (en) Data privacy protection protocol analysis method and device and electronic equipment
Venugopal et al. Use of genetic algorithms in software testing models
Cody-Kenny et al. The emergence of useful bias in self-focusing genetic programming for software optimisation
CN113656669A (en) Label updating method and device
Zhang et al. Reducing Test Cases with Attention Mechanism of Neural Networks
CN117290856B (en) Intelligent test management system based on software automation test technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination