CN116226673A

CN116226673A - Training method of buffer region vulnerability recognition model, vulnerability detection method and device

Info

Publication number: CN116226673A
Application number: CN202310490897.XA
Authority: CN
Inventors: 杨星; 纪守领; 吴志勇; 张旭鸿; 刘沛宇; 梁振宇; 许颢砾; 胡书隆; 沈传宝; 候晓雄; 刘加瑞
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-06-06
Anticipated expiration: 2043-05-05
Also published as: CN116226673B

Abstract

The application provides a training method, a vulnerability detection method and a device of a buffer vulnerability identification model. The method comprises the following steps: collecting an original software sample and a corresponding original label; determining buffer overflow attribute of each original software sample, and marking corresponding attribute value; code representation is carried out on each original software sample, and the code representation is mapped into a multidimensional vector according to the attribute value and is a code multidimensional vector; dividing the code multidimensional vector of each original software sample and the corresponding original label into a training set and a testing set; training the buffer vulnerability recognition model by using the training set; and evaluating and/or optimizing the trained buffer vulnerability recognition model by using the test set. In this way, the buffer region vulnerability identification model can accurately identify software features with vulnerability risks, accuracy and efficiency of vulnerability risk identification are improved, and the vulnerability feature analysis and detection requirements are met.

Description

Training method of buffer region vulnerability recognition model, vulnerability detection method and device

Technical Field

The present disclosure relates to the field of network security technologies, and in particular, to a method and apparatus for training a buffer vulnerability identification model, and a vulnerability detection method.

Background

At present, a network is filled with a plurality of software vulnerabilities, so that the risk of the network is gradually increased, and under the background, the software vulnerabilities must be detected, so that malicious attacks such as theft, modification and the like on users and application data in the network are avoided.

The software vulnerability mining is one of the main means for checking and finding security vulnerabilities in software systems, and mainly searches for design errors, coding defects and operation faults of the software by auditing codes of the software by utilizing various tools or analyzing the execution process of the software. Early vulnerability discovery techniques are classified into static analysis methods and dynamic analysis methods depending on whether program operation is relied on. However, static analysis methods are less accurate and dynamic analysis methods are less efficient.

Disclosure of Invention

The application provides a training method, a vulnerability detection method and a device of a buffer vulnerability identification model.

According to a first aspect of the present application, a training method of a buffer vulnerability identification model is provided. The method comprises the following steps:

collecting an original software sample and a corresponding original label;

determining buffer overflow attribute of each original software sample, and marking corresponding attribute value;

Code representation is carried out on each original software sample, and the code representation is mapped into a multidimensional vector which is a code multidimensional vector;

dividing the code multidimensional vector of each original software sample and the corresponding original label into a training set and a testing set;

training the buffer vulnerability recognition model by using the training set;

evaluating and/or optimizing the trained buffer zone vulnerability identification model by using a test set;

wherein, the code representation of each original software sample comprises:

extracting a receiver type feature A, a memory location feature B and a container feature C from an original software sample, respectively encoding the receiver type feature, the memory location feature and the container feature, and splicing the encoded receiver type feature, the encoded memory location feature and the encoded container feature to form a code representation of the original software sample;

the mapping the code representation into a multidimensional vector, which is a code multidimensional vector, comprises:

taking the maximum dimension value in the receiver type characteristic corresponding to each original software sample as a receiver type alignment dimension value A1; taking the maximum dimension value in the memory location characteristic corresponding to each original software sample as a memory location alignment dimension value B1; taking the maximum value of the dimension in the container characteristics corresponding to each original software sample as a container alignment dimension value C1; the dimensions of the receiver type feature, the memory location feature and the container feature corresponding to each original software sample are respectively expanded into A1, B1 and C1 in a way of supplementing 0 on the right side, namely an aligned receiver type feature A2, an aligned memory location feature B2 and an aligned container feature C2 corresponding to each original software sample are respectively formed; and taking the aligned receiver type characteristic A2, the aligned memory location characteristic B2 and the aligned container characteristic C2 corresponding to each original software sample as code multidimensional vectors.

Preferably, the receiver type feature is used to characterize the original software sample as having one or more of the following three code types, namely: the pointer indirect reference exists, the dangerous function exists, and the function with similarity to the dangerous function larger than a preset threshold value exists; the acquisition mode of the receiver type characteristics is as follows: performing lexical analysis on the original software sample to obtain codes with indirect references of pointers, dangerous functions and functions with similarity with the dangerous functions being larger than a preset threshold value, taking the occurrence positions of the codes and the types of the codes as data pairs, and splicing the data pairs; the dangerous function is a function which is defined by a user and used for realizing the function identical to the function of the standard library;

the memory location feature is used to characterize the presence of one or more of the following five cases in the raw software sample, namely: a stack buffer area is arranged at the memory position, a data section is arranged at the memory position, a BSS section is arranged at the memory position, and a shared memory is arranged at the memory position; the memory location feature is obtained by the following steps: carrying out grammar analysis on the original software sample, obtaining codes of a memory location occurrence stack buffer area, a memory location occurrence data section, a memory location occurrence BSS section and a memory location occurrence shared memory, taking the object types of the code occurrence position and the memory location occurrence as data pairs, and splicing all the data pairs;

The container features are used to characterize one of three types of definitions of the raw software sample, namely: struct, union, none; the acquisition mode of the container characteristics is as follows: and carrying out grammar analysis on the original software sample, obtaining a code with the appearance definition container type Struct, union, none, taking the appearance position of the code and the container type as data pairs, and splicing all the data pairs.

Preferably, the attribute value corresponding to the tag includes:

the corresponding buffer overflow attribute is labeled 1 when it indicates a buffer overflow, and 0 when it indicates no buffer overflow.

Preferably, the training the buffer vulnerability recognition model using the training set includes:

and inputting the code multidimensional vector of the training set sample and the original label into a buffer zone vulnerability recognition model, constructing a loss function according to the calculated difference between the input original label and the label output by the buffer zone vulnerability recognition model, minimizing the loss function by an optimization method, and adjusting the parameters of the model according to the minimized loss function.

Preferably, the evaluating and/or optimizing the trained buffer vulnerability recognition model by using the test set includes:

Inputting the code multidimensional vector of the test set sample into a buffer zone vulnerability recognition model, calculating one or more of the accuracy rate, recall rate and F metric value of the buffer zone vulnerability recognition model according to the original label of the test set sample and the label output by the buffer zone vulnerability recognition model, and evaluating and/or optimizing the model according to one or more of the accuracy rate, recall rate and F metric value.

According to a second aspect of the present application, there is provided a vulnerability detection method based on a buffer vulnerability identification model, including:

and inputting the code multidimensional vector of the software to be detected into a buffer zone vulnerability recognition model obtained by training by adopting the training method of the buffer zone vulnerability recognition model, and judging whether the software has a vulnerability and/or the vulnerability type according to the label output by the buffer zone vulnerability recognition model.

According to a third aspect of the present application, there is provided a vulnerability detection apparatus based on a buffer vulnerability recognition model, including:

the data acquisition unit is used for acquiring an original software sample and a corresponding original label;

the attribute marking unit is used for determining the buffer overflow attribute of each original software sample and marking the corresponding attribute value;

the code mapping unit is used for carrying out code representation on each original software sample, mapping the code representation into a multidimensional vector and obtaining a code multidimensional vector;

The grouping unit is used for dividing the code multidimensional vector of each original software sample and the corresponding original label into a training set and a testing set;

the training unit is used for training the buffer vulnerability identification model by utilizing the training set;

the evaluation unit is used for evaluating and/or optimizing the trained buffer zone vulnerability identification model by using the test set;

wherein, the code representation of each original software sample comprises:

According to a fourth aspect of the present application, there is provided a vulnerability detection apparatus based on a buffer vulnerability identification model, the apparatus comprising:

the input unit is used for inputting the code multidimensional vector of the software to be detected into the buffer zone vulnerability recognition model obtained by training by adopting the training method of the buffer zone vulnerability recognition model;

and the test unit is used for judging whether the software has the loopholes and/or the loophole types according to the labels output by the buffer loophole identification model.

According to the method and the device, the sample is marked according to the buffer overflow attribute, the buffer vulnerability identification model is trained according to the marking result and the code representation, so that the buffer vulnerability identification model can accurately identify software features with vulnerability risks, accuracy and efficiency of vulnerability risk identification are improved, and the requirement for vulnerability feature analysis and detection is met.

It should be understood that the description in this summary is not intended to limit key or critical features of embodiments of the present application, nor is it intended to be used to limit the scope of the present application. Other features of the present application will become apparent from the description that follows.

Drawings

The above and other features, advantages and aspects of embodiments of the present application will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. For a better understanding of the present disclosure, and without limiting the present disclosure, reference is made to the accompanying drawings, in which the same or similar reference numerals designate the same or similar elements, and wherein:

FIG. 1 is a flowchart of a training method of a buffer vulnerability recognition model provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for detecting vulnerabilities based on a buffer vulnerability recognition model according to an embodiment of the present application;

FIG. 3 is a block diagram of a training device for a buffer vulnerability recognition model according to an embodiment of the present application;

FIG. 4 is a block diagram of a vulnerability detection device based on a buffer vulnerability identification model according to an embodiment of the present application;

fig. 5 is a block diagram of an exemplary electronic device provided by an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In addition, the term "and/or" herein is merely one kind of association relation describing the association object, meaning that three kinds of relations may exist. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

The software bug identification is one of the main means for checking and finding security bugs in the software system, and is mainly to audit the codes of the software by using various tools or analyze the execution process of the software to find out the design errors, coding defects and operation faults of the software. Early vulnerability discovery techniques are classified into static analysis methods and dynamic analysis methods depending on whether program operation is relied on.

In recent years, the increase in software complexity presents a significant challenge to software security. With the continuous increase of the software scale and the diversification of the vulnerability morphology, the traditional buffer overflow vulnerability mining method cannot meet the security analysis requirement of complex software due to the problems of high false alarm rate and high missing report rate.

Traditional static analysis methods often rely on manual experts to construct vulnerability patterns, and as software complexity increases, manual construction costs are too high, and human subjectivity can affect false alarm rate and false alarm rate. Some open source static analysis tools often generate too many false positives, which affect the accuracy of vulnerability identification.

Whereas in dynamic analysis methods, dynamic test generation typically involves fuzzy testing and symbolic execution. To detect vulnerabilities, fuzzy tests randomly generate a test case to trigger a program failure, while symbolic execution gathers constraints as the program path is traversed and uses a constraint solver to generate the relevant test case. The fuzzy test cannot thoroughly and comprehensively understand the program, and symbol execution has problems in path explosion, constraint solving and memory modeling, so that the dynamic analysis method still has great difficulty in identifying the loopholes of the large-scale software system independently.

The application defines a buffer overflow vulnerability identification method based on machine learning, and the method is used for identifying various buffer overflow vulnerabilities existing in computer software or a system by extracting characteristics of samples of malicious source codes, training and learning and generating a model.

Fig. 1 is a flowchart of a training method 100 of a buffer vulnerability recognition model according to an embodiment of the present application.

As shown in fig. 1, the training method 100 of the buffer vulnerability recognition model includes:

s101, collecting an original software sample and a corresponding original label;

s102, determining buffer overflow attribute of each original software sample, and marking corresponding attribute value;

s103, code representation is carried out on each original software sample, and the code representation is mapped into a multidimensional vector which is a code multidimensional vector;

s104, dividing the code multidimensional vector of each original software sample and the corresponding original label into a training set and a testing set;

s105, training a buffer vulnerability recognition model by using the training set;

s106, evaluating and/or optimizing the trained buffer zone vulnerability recognition model by using a test set;

wherein, the code representation of each original software sample comprises:

Further, the receiver type feature is used to characterize the original software sample as having one or more of the following three code types: the pointer indirect reference exists, the dangerous function exists, and the function with similarity to the dangerous function larger than a preset threshold value exists; the acquisition mode of the receiver type characteristics is as follows: performing lexical analysis on the original software sample to obtain codes with indirect references of pointers, dangerous functions and functions with similarity with the dangerous functions being larger than a preset threshold value, taking the occurrence positions of the codes and the types of the codes as data pairs, and splicing the data pairs; the dangerous function is a function which is defined by a user and used for realizing the function identical to the function of the standard library;

In S101, the collecting the raw software sample includes collecting the raw software sample from a software dataset. The software data set marks original labels for the original software samples, and the original labels comprise whether the original software samples have vulnerabilities or not or the types of the vulnerabilities of the marked software samples.

Correspondingly, the training method 100 of the buffer vulnerability recognition model described in the present application can train and recognize whether the software has a buffer vulnerability or a software buffer overflow vulnerability type.

In S102, the determining a buffer overflow attribute of the original software sample includes:

if the training method of the buffer leak identification model is used for training and identifying whether the software has buffer leak, the buffer overflow attribute is buffer overflow and buffer overflow not, the attribute value is 1 and 0 correspondingly, and if the training method of the buffer leak identification model is used for training and identifying the buffer leak type of the software, the buffer overflow characteristic comprises the characteristic related to the buffer overflow leak type.

In some embodiments, the buffer overflow attribute includes one or more of a receiver type feature, a memory location feature, a container feature.

Obviously, the receiver anomaly may cause buffer boundary overflow; and the memory location characteristic indicates the location of the receive buffer; the container features describe the structure that encapsulates the receiving buffer, and in general, the more complex the container structure, the more vulnerable the buffer is. According to the method, analysis of overflow holes of multiple groups of buffer areas shows that nearly 30% of holes are container characteristic holes, so that the container of the buffer areas is used as one of the overflow attributes of the buffer areas.

According to the embodiment of the application, the buffer overflow attribute is set according to the characteristics related to the buffer overflow vulnerability types, so that the buffer vulnerability identification model can conveniently conduct classified training on software vulnerabilities to distinguish the software vulnerability types and repair the vulnerabilities according to symptoms.

In some embodiments, the receiver type characteristics include whether a pointer indirectly references, whether there is a hazard function, whether there is one or more of the functions that are similar to the hazard function;

if a statement belongs to without proper buffer boundary check: buffer overflow may occur if the pointer references indirectly, the array writes, and one of the three receiver types is a hazard function.

Wherein: pointer indirect reference: where an application indirectly references its intended valid but virtually empty pointer, it typically results in a crash or exit.

Array writing: a standard library or user-defined call to copy or fill the buffer. Formatted string output may also result in potential buffer overflows.

Hazard function: some user-defined functions have the same effect, except for all standard library function calls. The present application classifies user-defined functions having similar function names and exactly the same number of parameters as dangerous functions.

Or, the memory location feature includes whether one or more of a stack buffer, a heap buffer, a data segment, a BSS segment, and a shared memory has a memory location present;

wherein the stack buffer describes a locally and non-statically defined array;

heap buffers describe dynamically allocated memory for satisfying memory applications;

the data segment describes a static variable or a global variable;

BSS segment describes uninitialized global or static variables;

shared memory describes a method of inter-process communication (IPC).

Or, the container characteristics include whether the container is one of Struct, union, none, or one of Struct, union, none, others.

According to the embodiment of the application, the buffer overflow attribute is determined from multiple directions so as to cover more buffer overflow vulnerability types, and the model is convenient to recognize and train.

In S102, the attribute value corresponding to the flag includes:

According to the embodiment of the application, the attribute value is marked on the buffer overflow attribute, the buffer overflow attribute is quantized, and the machine can learn in a targeted mode.

In this application, another embodiment is to extract source code related to a software bug from an original software sample, and retain text format data related to the extracted source code, the Github submitted information, and other software programs. And extracting corresponding code characterization according to the characterization forms of codes such as code measurement, token sequence, abstract syntax tree, graph and the like from the processed sample. The code tokens are mapped into vectors in order, and then correspond to code multidimensional vectors.

According to the embodiment of the application, the codes are mapped into the code multidimensional vectors, so that the code representation and the attribute of the original software sample are combined, and the machine can learn and identify the vulnerability type according to the characteristics of the original software sample.

In S105, training the buffer hole recognition model using the training set includes:

The loss function is an operation function for measuring the difference degree between the predicted value and the true value of the model, and the smaller the loss function is, the better the robustness of the model is. The optimization method mainly comprises a gradient descent method, a random gradient descent method and a batch gradient descent method.

According to the embodiment of the application, the loss function is minimized through the optimization method, so that parameters of the model are continuously adjusted, robustness of the buffer zone vulnerability recognition model is improved through training for a plurality of times, and recognition accuracy of the model obtained through training is higher.

In S106, the evaluating and/or optimizing the trained buffer vulnerability recognition model using the test set includes:

inputting the code multidimensional vector of the test set sample into a buffer zone vulnerability recognition model, calculating one or more of the accuracy rate, recall rate and F metric value of the buffer zone vulnerability recognition model according to the original label of the test set sample and the label output by the buffer zone vulnerability recognition model, and evaluating and/or optimizing the buffer zone vulnerability recognition model according to one or more of the accuracy rate, recall rate and F metric value.

The tuning of the model includes fine tuning parameters of the model.

According to the embodiment of the application, the buffer vulnerability identification model is evaluated and/or optimized according to the test set sample, and the capability of the buffer vulnerability identification model is known, so that a user can know the accuracy of software vulnerability identification to a certain extent.

Fig. 2 is a flowchart of a method 200 for detecting vulnerabilities based on a buffer vulnerability recognition model according to an embodiment of the present application.

As shown in fig. 2, the vulnerability detection method 200 based on the buffer vulnerability identification model includes:

s201, inputting a code multidimensional vector of software to be detected into the buffer zone vulnerability recognition model obtained by training the training method of the buffer zone vulnerability recognition model;

s202, judging whether the software has the loopholes and/or the loophole types according to the labels output by the buffer loophole identification model.

In some embodiments, further comprising updating the training set samples, comprising:

when the buffer loophole identification model cannot output the label, the corresponding buffer overflow attribute is marked, and the corresponding code multidimensional vector, the buffer overflow attribute and the label are stored in the training set.

It can be appreciated that when the buffer vulnerability recognition model cannot output features, it is obvious that the corresponding code multidimensional vector is a new code that has not been trained before, and the new code is stored into the training set, so that the buffer vulnerability recognition model performs complementary training according to the corresponding buffer vulnerability type.

According to the embodiment of the application, the buffer overflow characteristics which do not appear can be identified by updating the training set sample from time to time, and meanwhile, the training set is updated and stored, so that the updating speed of the buffer leak identification model is adapted to the conversion speed of the buffer overflow leak, and whether the leak exists in the software or not can be rapidly and effectively detected.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.

The foregoing is a description of embodiments of the method, and the following further describes embodiments of the device.

Fig. 3 is a block diagram of a training apparatus 300 of a buffer vulnerability recognition model according to an embodiment of the present application.

As shown in fig. 3, the training apparatus 300 of the buffer vulnerability recognition model includes:

The data acquisition unit 301 is configured to acquire an original software sample and a corresponding original tag;

the attribute marking unit 302 is configured to determine a buffer overflow attribute of each original software sample, and mark a corresponding attribute value;

the code mapping unit 303 is configured to perform code representation on each original software sample, map the code representation into a multidimensional vector, and obtain a code multidimensional vector;

a grouping unit 304, configured to divide the code multidimensional vector and the corresponding original label of each original software sample into a training set and a testing set;

a training unit 305, configured to train the buffer vulnerability identification model by using the training set;

the evaluation unit 306 is configured to evaluate and/or tune the trained buffer vulnerability identification model by using the test set;

wherein, the code representation of each original software sample comprises:

Fig. 4 is a block diagram of a vulnerability detection apparatus 400 based on a buffer vulnerability recognition model according to an embodiment of the present application.

As shown in fig. 4, the vulnerability detection apparatus 400 based on the buffer vulnerability recognition model includes:

The input unit 401 inputs the code multidimensional vector of the software to be detected into the buffer vulnerability recognition model obtained by training the training method of the buffer vulnerability recognition model;

and the test unit 402 is configured to determine whether the software has a bug and/or a bug type according to the tag output by the buffer bug recognition model.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the technical scheme of the application, the acquisition, storage, application and the like of the related user personal information all accord with the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present application, there is also provided an electronic device, a readable storage medium and a computer program product.

Fig. 5 shows a schematic block diagram of an electronic device 500 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

The device 500 comprises a computing unit 501 that may perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 502 or loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Various components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, etc.; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508 such as a magnetic disk, an optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 501 performs the various methods and processes described above, such as

method

100 or 200. For example, in some embodiments, the

method

100 or 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of

method

100 or 200 described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the

method

100 or 200 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solutions of the present application are achieved, and the present application is not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. The training method of the buffer vulnerability recognition model is characterized by comprising the following steps of:

collecting an original software sample and a corresponding original label;

Training the buffer vulnerability recognition model by using the training set;

wherein, the code representation of each original software sample comprises:

2. The method of training a buffer vulnerability identification model of claim 1, wherein the receiver type features are used to characterize one or more of the three code types in the raw software sample: the pointer indirect reference exists, the dangerous function exists, and the function with similarity to the dangerous function larger than a preset threshold value exists; the acquisition mode of the receiver type characteristics is as follows: performing lexical analysis on the original software sample to obtain codes with indirect references of pointers, dangerous functions and functions with similarity with the dangerous functions being larger than a preset threshold value, taking the occurrence positions of the codes and the types of the codes as data pairs, and splicing the data pairs; the dangerous function is a function which is defined by a user and used for realizing the function identical to the function of the standard library;

3. The method for training a buffer vulnerability recognition model according to claim 2, wherein the attribute values corresponding to the markers include:

4. The method for training the buffer vulnerability identification model of claim 2, wherein training the buffer vulnerability identification model using the training set comprises:

5. The method for training the buffer vulnerability recognition model according to claim 2, wherein the evaluating and/or optimizing the trained buffer vulnerability recognition model by using the test set comprises:

6. A vulnerability detection method based on a buffer vulnerability recognition model is characterized by comprising the following steps:

inputting a code multidimensional vector of software to be detected into a buffer zone vulnerability identification model obtained by training the training method of the buffer zone vulnerability identification model according to any one of claims 1-5, and judging whether the software has a vulnerability and/or a vulnerability type according to a label output by the buffer zone vulnerability identification model.

7. A training device for a buffer vulnerability recognition model, comprising:

wherein, the code representation of each original software sample comprises:

8. A vulnerability detection apparatus based on a buffer vulnerability recognition model, comprising:

an input unit for inputting a code multidimensional vector of software to be detected into a buffer vulnerability recognition model obtained by training the buffer vulnerability recognition model by the training method of any one of claims 1-5;