CN114064117A

CN114064117A - Code clone detection method and system based on byte code and neural network

Info

Publication number: CN114064117A
Application number: CN202111400977.9A
Authority: CN
Inventors: 万邦睿; 董双; 黄江平; 钱鹰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-02-18

Abstract

The invention belongs to the technical field of code clone detection, and particularly relates to a code clone detection method and a system based on byte codes and a neural network, wherein the method comprises the following steps: acquiring code data to be detected, inputting the code data to be detected into a trained code clone detection model to obtain a detection result, and marking and storing the detection result; compared with the existing detection method based on texts and lexical methods, the method has the characteristic of fully considering code semantic information, and can improve the detection effect on type 3 and type 4 clones in the aspects of accuracy, recall rate, F1 measurement value and the like.

Description

Code clone detection method and system based on byte code and neural network

Technical Field

The invention belongs to the technical field of code clone detection, and particularly relates to a code clone detection method and system based on byte codes and a neural network.

Background

Code clone (code clone), also known as a duplicate code or similar code, refers to two or more identical or similar source code segments that exist in a code library. Code clone detection is a difficult problem in the field of software engineering. Code clones are divided into four classes, Type 1(Type-1) clones, Type 2(Type-2) clones, Type 3(Type-3) clones and Type 4(Type-4) clones, respectively. The type 1 clone refers to a code pair with two completely same code segments except for a space and a comment; the type 2 clone refers to a code pair which is completely the same on the basis of the type 1 except for a variable name, a type name and a function name; type 3 cloning means that on the basis of type 2, two code fragments have similar structures and have addition, reduction or modification of program statements; type 4 clones refer to heterogeneous codes with the same function.

For the above four clone categories, the current mainstream detection methods can be classified according to different code representation modes: text-based detection methods, lexical (Token) -based detection methods, grammar-based detection methods, and semantic-based detection methods. However, the above method has the following disadvantages:

1. text and lexical based detection methods detection accuracy (Precision), Recall (Recall), and F1 metric values were too low on type 3 and type 4 clones.

2. The detection method based on grammar and semantics usually needs to construct an abstract syntax tree, a program dependency graph and the like, and the space-time complexity is high.

3. In some scenarios (e.g., code obfuscating a bytecode file), the bytecode file cannot be decompiled as source code. At this time, the source code-based comparison method cannot detect a potential code cloning phenomenon.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a code clone detection method based on byte codes and a neural network, which comprises the following steps: acquiring code data to be detected, inputting the code data to be detected into a trained code clone detection model to obtain a detection result, and marking and storing the detection result;

the process of training the code clone detection model comprises the following steps:

s1: acquiring an original code data set; preprocessing the code data to obtain a formatted data set;

s2: dividing the formatted data set to obtain a training set, a verification set and a test set;

s3: extracting features of the data in the training set to obtain features of the code data, and collecting the features of the code data to obtain a matrix pair set;

s4: inputting the matrix pair set into a neural network to obtain a detection result;

s5: calculating a loss function value of the model according to the calculated result, inputting the data concentrated in verification into the model, continuously adjusting the parameters of the model, and finishing the training of the model when the loss function is minimum;

s6: and inputting the test set into a trained code clone detection model, and evaluating the model.

Preferably, the preprocessing of the code data includes: cleaning original code data and deleting redundant data; carrying out format conversion on the cleaned data to obtain a formatted data set; each piece of data in the formatted data set comprises a pair of bytecode instruction sequences and a class label.

Preferably, the process of extracting features from the data in the training set includes:

step 1: establishing a byte code word vector corpus;

step 2: constructing a custom word vector model, inputting a byte code word vector corpus into the custom word vector model for training to obtain a trained word vector model;

and step 3: and inputting the byte code instruction sequence pairs in the data set into the trained word vector model to obtain a matrix pair set formed by combining the word vectors.

Further, the process of establishing the byte code word vector corpus includes:

step 11: selecting a code base;

step 12: compiling the selected code library to obtain a class file library;

step 13: converting the class file library into a byte code instruction file library;

step 14: and extracting the operation code sequences of all functions in each bytecode instruction file as a word vector training corpus, wherein the operation code sequence of each function is used as a piece of training data.

Further, the process of training the custom word vector model includes:

step 21: constructing a word vector model;

step 22: setting key parameters of a word vector model; the key parameters comprise the times of training, the vector shape and the size of a sliding window;

step 23: and training the word vector model according to the set key parameters and the byte code word vector corpus to obtain the trained word vector model.

Preferably, the processing the set of matrix pairs by using the neural network includes: and inputting the matrix pairs in the matrix pair set into a neural network, and obtaining a predicted value through forward propagation.

Preferably, the loss function expression of the model is:

wherein N represents the total amount of data, y_iIndicates the actual value of the ith data,

indicating the predicted value of the ith data.

A bytecode-and-neural-network-based code clone detection system, the system comprising: the device comprises a data acquisition module, a data detection module and an output module;

the data acquisition module is used for acquiring code data to be detected and inputting the acquired code data to be detected into the data detection module to obtain a detection result;

the data detection module comprises a preprocessing unit, a feature extraction unit and a similarity calculation unit;

the preprocessing unit comprises a formatting component and a dividing component; the formatting component is used for converting the data format of the code data to be detected sent by the data acquisition module to obtain data in a specific format; the dividing component is used for dividing a data set with a specific format to obtain a training set, a verification set and a test set;

the characteristic extraction unit acquires data in a training set, a verification set and a test set and converts the data in each set into a matrix;

the similarity calculation unit is used for acquiring the matrix pairs generated by the feature extraction unit, calculating the similarity of each matrix pair and classifying the code data to be detected according to the similarity;

and the output module is used for acquiring the classification result of the similarity calculation unit and outputting the classification result.

Preferably, the data in the specific format comprises each piece of data in the formatted data set, which comprises a pair of instruction sequences in bytecode and a class label.

The invention has the beneficial effects that:

1. compared with the existing detection method based on texts and lexical methods, the method has the characteristic of fully considering code semantic information, and can improve the detection effect on type 3 and type 4 clones in the aspects of accuracy, recall rate, F1 measurement value and the like.

2. The method provided by the invention characterizes the byte code instruction sequence in a word vector matrix form, and has the characteristic of low detection time consumption compared with the existing detection method based on grammar and semantics.

3. The method provided by the invention combines a neural network technology, can learn the potential rule of code cloning from the code cloning data set, and has a gain effect compared with the traditional detection method based on byte codes.

Drawings

FIG. 1 is a schematic illustration of an embodiment of the present invention;

FIG. 2 is a block diagram of a data detection module according to an embodiment of the present invention;

FIG. 3 is a diagram of a preprocessing unit according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the formatting of a data set according to an embodiment of the present invention;

FIG. 5 is a diagram of a feature extraction unit according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating an embodiment of a word vector corpus extraction process;

FIG. 7 is a diagram illustrating word vector mapping according to an embodiment of the present invention;

FIG. 8 is a diagram of a similarity calculation unit according to an embodiment of the present invention;

fig. 9 is a diagram of a detection network model structure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A code clone detection method based on byte codes and a neural network comprises the following steps: and acquiring code data to be detected, inputting the code data to be detected into the trained code clone detection model to obtain a detection result, and marking and storing the detection result.

s4: inputting the matrix pair set into a twin neural network based on LSTM to obtain a detection result;

The process of preprocessing the code data comprises the following steps: cleaning original code data and deleting redundant data; carrying out format conversion on the cleaned data to obtain a formatted data set; each piece of data in the formatted data set comprises a pair of bytecode instruction sequences and a class label.

The process of format conversion of the cleaned data comprises the following steps:

step 1: the source code is converted to a class file.

Step 2: the class file is converted into a bytecode instruction file.

And step 3: and mapping the functions in the byte code instruction file and the functions in the source codes one by one according to the relative path of the file in which the functions are located, the function name, the return value type, the parameter list and the modifier list, wherein the functions in the byte code instruction file are embodied as byte code instruction sequences.

And 4, step 4: and generating the clone label of the function pair in the byte code instruction file according to the clone label of the source code function pair to complete format conversion.

The process of extracting the features of the data in the training set comprises the following steps:

step 1: a byte code word vector corpus is established. As shown in fig. 6, the construction method includes: selecting Java source code; secondly, compiling the source code to obtain a corresponding class file library; then, converting the class file library into a byte code instruction file library by using a javap command; and finally, extracting the operation code sequences of all functions in each bytecode instruction file as a word vector training corpus, wherein the operation code sequence of each function is used as a piece of training data.

Step 2: and (3) constructing a custom word vector model, and inputting the byte code word vector corpus into the custom word vector model for training to obtain a trained word vector model. The method comprises the following specific steps: firstly, constructing a CBOW model of word2 vec; then, inputting a word vector training corpus to train; and finally, obtaining a trained word vector model, wherein the word vector model records the mapping relation between the operation codes and the word vectors. The key parameter setting condition is as follows: training batches: 1000, vector shape: 1 × 100, sliding window size: 10.

and step 3: and inputting the byte code instruction sequence pair set into the trained word vector model to obtain a matrix pair set.

The process of converting the byte code instruction sequence into the matrix comprises the following steps:

step 1: a Tokenizer is used to convert all the opcodes in the dataset into an opcode dictionary, with opcode subscripts in the dictionary counting from 1.

Step 2: traversing the operation code dictionary by using the trained word vector model, and converting the operation code dictionary into a two-dimensional word vector matrix, wherein the operation code with index i in the dictionary corresponds to the word vector in the (i-1) th row in the word vector matrix.

And step 3: the word vector matrix is used as a parameter to generate an Embedding layer, the Embedding layer can convert positive integers (operation code subscripts) into word vectors corresponding to operation codes, and the positive integer sequence forms a matrix with fixed dimensionality.

And 4, step 4: converting the byte code instruction sequence into a digital sequence through a Tokenizer, wherein the meaning represented by the digital sequence is the same as the subscript of the operation code in the operation code dictionary;

and 5: and inputting the subscript sequence of the operation code into an Embedding layer to generate a matrix.

A bytecode-based neural network code clone detection system, as shown in fig. 1, the system comprising: the device comprises a data acquisition module, a data detection module and an output module; the data acquisition module is used for acquiring code data to be detected and inputting the acquired code data to be detected into the data detection module to obtain a detection result; the data detection module is used for detecting and classifying the code data to be detected to obtain a detection result; and the output module is used for acquiring the classification result of the data detection module and outputting the classification result.

An embodiment of a data detection module, as shown in fig. 2, includes: the device comprises a preprocessing unit, a feature extraction unit and a similarity calculation unit;

the preprocessing unit is used for extracting byte code instruction sequence pairs in the data set and dividing the data set, so that data processing of a subsequent module is facilitated. I.e., the preprocessing unit, is shown in fig. 3 as including a formatting component and a partitioning component.

The formatting component can convert the provided bytecode clone detection data set into a data set conforming to a specific format required by the detection system, as shown in fig. 4, each piece of data of the formatted data set should include a bytecode instruction sequence pair and a classification tag.

The partitioning component maps the data set into 4: 1: the 1 proportion is randomly divided into a training data set, a verification set and a test set which are respectively used for training a model, adjusting hyper-parameters and obtaining the evaluation index of the model.

The feature extraction unit can convert the byte code instruction sequence and the submission byte code instruction sequence in the data set into a matrix with specified dimensionality, so that the matrix can be processed by a neural network model in the similarity calculation unit.

As shown in fig. 5, the feature extraction unit includes a word vector generation component and a sequence conversion component.

The word vector generating component is used for constructing a word vector training model; the method comprises the following specific steps:

step 1: a byte code word vector corpus is established. The method comprises the following steps: firstly, selecting Java source code; secondly, compiling the source code to obtain a corresponding class file library; then, converting the class file library into a byte code instruction file library by using a javap command; and finally, extracting the operation code sequences of all functions in each bytecode instruction file as a word vector training corpus, wherein the operation code sequence of each function is used as a piece of training data.

Step 2: and constructing a word vector training model. The method comprises the following steps: firstly, constructing a CBOW model of word2 vec; then, inputting a word vector training corpus to train; and finally, obtaining the trained word vector model. The key parameter setting condition is as follows: training batches: 1000, vector shape: 1 × 100, sliding window size: 10.

the sequence conversion component can convert the data set and the sending byte code instruction sequence into a matrix according to the word vector training model, so that the matrix can be processed by a neural network model in the similarity calculation unit. As shown in fig. 7, the unit converts the opcode sequence into a matrix.

The similarity calculation unit obtains a trained model by constructing a twin neural network model and depending on training data generated by the feature extraction unit. And calculating the similarity of the inspection byte code instruction sequence pair through the model, and obtaining the clone classification of the inspection byte code instruction sequence pair according to the classification threshold determined in the training process.

As shown in fig. 8, the similarity calculation unit includes a neural network model, a training component, an evaluation component, and a detection component.

The neural network model is used for calculating the similarity condition of the byte code instruction sequence pair.

As shown in fig. 9, the neural network model is a LSTM-based twin neural network, which is composed of an input layer, a backbone network, and a classification network. The input layer includes: an Embedding layer is constructed to serve as an input layer, key parameters set by the input layer are output _ dim which is 100 and input _ length which is 300, wherein output _ dim corresponds to word vector dimensions, and input _ length corresponds to the maximum sequence of operation codes in a training set. The Embedding layer is an image of the sequence conversion component. The trunk network uses two long-short term memory models (LSTMs) sharing weight as sub-networks of the twin neural network for processing the Embedding layer data, and the LSMT module key parameters are set as follows, where units is 175, dropout is 0.15, and recurent _ dropout is 0.15. The classification network is formed by intersecting two Dropout layers, two BatchNormalization layers and two Dense layers, and binary intersection entropy is set as a loss function. The proportion of open neurons in the two dropouts is 15%, the number of nodes of a neuron in the first Dense layer is 175, and the number of nodes of a neuron in the second Dense layer is 1.

The training component can obtain a detection model with excellent evaluation indexes through continuous iterative learning. The specific process of training comprises:

step 1: and (5) training. And setting model hyper-parameters including iteration times, batch size, learning rate and the like, training the model by using a training set divided by a dividing component in the preprocessing unit, and training the parameters of the model by virtue of forward propagation, backward propagation and a loss function. The specific training process comprises the following steps:

1. inputting training data into an Embedding layer to generate a matrix set of two groups of 300 multiplied by 100 dimensional matrixes.

2. And respectively inputting the two groups of matrix sets into an LSTM module sharing weight to obtain two groups of 175-dimensional vector sets.

3. And splicing the two groups of vectors to obtain a group of 300-dimensional vector sets.

4. And performing parameter training on the vector set through a Dropout layer, a BatchNormalization layer and a Dense layer in sequence to obtain a 100-dimensional vector set.

5. And (3) sequentially passing the vector set through a Dropout layer, a BatchNormalization layer and a Dense layer to obtain a final predicted value, and bringing the predicted value and an actual value into a loss function to obtain a loss value.

6. And performing backward propagation according to the loss value and the predicted value and updating the parameters of each layer.

Step 2: and (6) verifying. Obtaining an evaluation index by using the divided verification set: accuracy, recall, F1 metrics. Wherein the accuracy calculation formula is as follows:

the recall ratio calculation formula is:

the F1 metric calculation formula is:

wherein TP represents a true positive case, FP represents a false positive case, and FN represents a false negative case.

And step 3: and (3) repeating the step 1 and the step 2, and selecting model parameters of the optimal evaluation index, wherein the parameters of the model comprise parameters and hyper-parameters.

The evaluation component is configured to evaluate a generalization capability of the model. Evaluating the trained model by using a test set divided by the preamble module to obtain an evaluation index: accuracy, recall, F1 metrics, and the like.

The detection component calculates the similarity of the byte code instruction sequence pair to be detected according to the trained model, the similarity is expressed as a predicted value of the neural network to the input byte code instruction sequence pair, the numerical value interval is [0,1], and according to a threshold value determined during training: and 0.5, obtaining a sending bytecode instruction sequence to classify the clone. If the similarity score of the check byte code instruction sequence pair is within the interval (0.5, 1), it is a clone pair, and if the similarity score is within the interval [0,0.5], it is an uncloneable pair.

Step 1: and extracting the operation code. And extracting the operation codes in the instruction sequence pairs of the submission byte codes and forming operation code sequence pairs.

Step 2: the word vectors are converted. And training a model by using the word vector in the feature extraction unit, and converting the operation code sequence pair into a matrix pair.

And step 3: and obtaining clone classification. Inputting the matrix pair into the trained twin neural network model to obtain a similarity value of the operation code sequence pair, and obtaining the clone classification of the operation code sequence pair according to a classification threshold, wherein the classification is the classification of the byte code instruction sequence pair to be detected.

A code clone detection system based on byte codes and a neural network further comprises a transcoding module, wherein the transcoding module is used for converting all functions in a submission file and a submission item into a byte code instruction sequence set, and the transcoding process comprises the following steps:

step 1: and (5) unifying the encoding categories. And converting the Java source code file in the submission file into a class file, wherein the class file is not changed.

Step 2: and converting the file format. The class file is converted to a bytecode instruction file using a javap command.

And step 3: a sequence of bytecode instructions is extracted. All byte code instruction sequences in the byte code instruction file are extracted, and each byte code instruction sequence represents a function.

Furthermore, the submission content is a class file pair or a Java source code file pair or a combination of the class file and the Java source code file pair; according to the code cloning phenomenon among multiple projects with the same format or different formats, the invention adopts a transcoding module to transcode the inspection content, namely the module can convert the three types of detection into a plurality of pairs of byte code instruction sequence sets, and compares sequences in the plurality of pairs of byte code instruction sequence sets in a pairwise comparison mode to obtain a plurality of pairs of byte code instruction sequences; and inputting the obtained multiple pairs of byte code instruction sequences into a detection assembly to obtain the clone distribution condition of the function in the submitted file pair. For example, if the numbers of file pair functions are 10 and 12, respectively, then a 10 × 12-120 pair bytecode instruction sequence pair can be obtained, and 120 clone classification results can be obtained.

Furthermore, according to specific detection requirements, the censorship content may also be a class file or a Java source code file or a plurality of class files and a plurality of Java source code files, and the like, which are used for detecting the code cloning condition inside the project. And converting the detection types into a byte code instruction sequence set by means of a designed transcoding module, wherein each byte code instruction sequence represents a function, multiple pairs of byte code instruction sequence pairs can be obtained by adopting a pairwise comparison mode, and the function clone distribution condition in the inspection file is obtained through a detection assembly. For example, the number of functions in the submission file is 10, 10 byte code instruction sequences can be obtained after the functions are processed by the transcoding module, and comparison between every two byte code instruction sequences is obtained

For byte code instruction sequence pairsAnd detecting the sequence pairs of the secondary pair 45 to finally obtain a function cloning result in the inspection file.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A code clone detection method based on byte codes and a neural network is characterized by comprising the following steps: acquiring code data to be detected, inputting the code data to be detected into a trained code clone detection model to obtain a detection result, and marking and storing the detection result;

2. The method for detecting code clone based on byte code and neural network as claimed in claim 1, wherein the process of preprocessing the code data comprises: cleaning original code data and deleting redundant data; carrying out format conversion on the cleaned data to obtain a formatted data set; each piece of data in the formatted data set comprises a pair of bytecode instruction sequences and a class label.

3. The method of claim 1, wherein the step of extracting the features of the data in the training set comprises:

step 1: establishing a byte code word vector corpus;

and step 3: and inputting the byte code instruction sequence pair into the trained word vector model to obtain a matrix pair set.

4. The method according to claim 3, wherein the process of creating a byte-code word vector corpus comprises:

step 11: selecting a code base;

step 12: compiling the selected code library to obtain a class file library;

5. The method of claim 3, wherein the training of the custom word vector model comprises:

step 21: constructing a word vector model;

6. The method of claim 1, wherein the processing of the set of matrix pairs using the neural network comprises: and inputting the matrix pairs in the matrix pair set into a neural network, and obtaining a predicted value through forward propagation.

7. The method of claim 1, wherein the loss function expression of the model is as follows:

indicating the predicted value of the ith data.

8. A bytecode-based neural network code clone detection system, comprising: the device comprises a data acquisition module, a data detection module and an output module;

the preprocessing unit comprises a formatting component and a dividing component; the formatting component is used for carrying out data format conversion on the code data in the acquired original data set to obtain data in a specific format; the dividing component is used for dividing a data set with a specific format to obtain a training set, a verification set and a test set;

the characteristic extraction unit acquires data in a training set, a verification set and a test set and converts the data in each set into a matrix pair set;

the similarity calculation unit is used for acquiring the matrix pairs of the feature extraction unit, predicting the clone classification of each matrix pair, wherein the predicted value represents the similarity, and classifying the code data according to the similarity;

9. The system of claim 8, wherein the data in the specific format comprises each piece of data in the formatted data set that includes a pair of instruction sequences in the bytecode and the class tag.