CN110427464B

CN110427464B - Code vector generation method and related device

Info

Publication number: CN110427464B
Application number: CN201910747430.2A
Authority: CN
Inventors: 赵旸; 刘思凡; 邱旻峰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2023-09-26
Anticipated expiration: 2039-08-13
Also published as: CN110427464A

Abstract

The embodiment of the application provides a code vector generation method and a related device, which are used for converting a code text into a word vector, inputting the word vector forward and backward into a neural network model to obtain an output vector, expressing the sequence relation between the front and the back of the word vector in the output vector, determining the code vector according to the output vector and a weight vector representing the importance degree, and realizing weighting of the output vector according to the importance degree of part of codes in the code text. The code vector generated by the embodiment of the application not only can represent the front-back sequence relation of the codes in the code text, but also can weight the codes of important parts, and better analysis effect can be obtained by analyzing the code vector provided by the embodiment of the application.

Description

Code vector generation method and related device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method for generating a code vector and a related device.

Background

Code analysis tasks such as error detection, code generation, code complement and the like by using deep learning technology are becoming industry hotspots. After the server acquires a code segment, the code segment is converted into a vector, and then error detection is carried out on the vector through a deep learning technology.

Currently only the open source word vector tool (word 2vec tool) is used to convert the code segments into vectors, the code vectors are generated by simple summing averaging, and then analyzed from the code vectors.

The analysis method can not analyze the front-back sequence relation of words, can not analyze important key partial codes in an emphasized manner, and has poor analysis effect.

Disclosure of Invention

The embodiment of the application provides a code vector generation method and a related device, which are used for solving the technical problem of poor analysis effect of the existing code analysis method.

In view of this, a first aspect of an embodiment of the present application provides a method for generating a code vector, including:

a first word vector sequence and a second word vector sequence corresponding to a code text are obtained, wherein the first word vector sequence is formed by arranging first word vectors to N-th word vectors in sequence, the second word vector sequence is formed by arranging N-th word vectors to the first word vectors in sequence, and N is an integer larger than 1;

obtaining an output vector sequence through a bidirectional LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors;

Calculating the corresponding score of the output vector according to the output vector sequence and the weight vector;

and generating a code vector corresponding to the output vector sequence according to the score corresponding to the output vector.

A second aspect of an embodiment of the present application provides a method for code analysis, including:

obtaining an output vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors;

generating a code vector corresponding to the output vector sequence according to the score corresponding to the output vector;

performing code analysis according to the code vector to generate a code analysis result;

and sending the code analysis result to terminal equipment so that the terminal equipment displays the code analysis result.

A third aspect of an embodiment of the present application provides an apparatus for generating a code vector, including:

the code text generation unit is used for generating a code text according to a code text, wherein the code text is used for generating a first word vector sequence and a second word vector sequence, the first word vector sequence is formed by sequentially arranging a first word vector to an N-th word vector, the second word vector sequence is formed by sequentially arranging the N-th word vector to the first word vector, and N is an integer larger than 1;

the processing unit is used for acquiring an output vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors;

The processing unit is also used for calculating the score corresponding to the output vector according to the output vector sequence and the weight vector;

and the generating unit is used for generating a code vector corresponding to the output vector sequence according to the score corresponding to the output vector.

In one possible design, in an implementation manner of the third aspect of the embodiment of the present application, the obtaining unit is further configured to:

acquiring the code text;

converting the code text into a tag sequence, wherein the tag sequence is formed by converting each word or symbol in the code text;

generating the N word vectors through a word vector tool according to the marking sequence to obtain a first word vector sequence;

and arranging the first word vector sequences in an inverted order to obtain second word vector sequences.

In one possible design, in an implementation manner of the third aspect of the embodiment of the present application, the apparatus further includes:

and the sending unit is used for sending the code vector to the terminal equipment so that the terminal equipment displays the code vector.

In one possible design, in one implementation of the third aspect of the embodiments of the present application,

the generating unit is further configured to: performing code analysis according to the code vector to generate a code analysis result;

The transmitting unit is further configured to: and sending the code analysis result to terminal equipment so that the terminal equipment displays the code analysis result.

In one possible design, in an implementation manner of the third aspect of the embodiments of the present application, the processing unit is further configured to:

determining a weight fraction corresponding to the output vector according to the output vector and the weight vector;

determining a weight total score corresponding to the output vector sequence according to the output vector sequence and the weight vector corresponding to each output vector;

and determining the score of the output vector according to the weight score and the weight total score.

In one possible design, in an implementation manner of the third aspect of the embodiment of the present application, the generating unit is further configured to:

multiplying the fraction with the output vector to obtain a block vector;

and splicing the block vectors to form the code vector.

A fourth aspect of an embodiment of the present application provides a server, including: memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory, and comprises the following steps:

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

acquiring the code text;

and sending the code vector to terminal equipment so that the terminal equipment displays the code vector.

multiplying the fraction with the output vector to obtain a block vector;

and splicing the block vectors to form the code vector.

A fifth aspect of an embodiment of the application provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of the first or second aspect.

A sixth aspect of an embodiment of the application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first or second aspect.

From the above technical solutions, the embodiment of the present application has the following advantages:

according to the embodiment of the application, the code text is converted into the word vector, the word vector is input into the neural network model in the forward direction and the reverse direction to obtain the output vector, so that the sequence relation between the front part and the rear part of the word vector is expressed in the output vector, the code vector is determined according to the output vector and the weight vector representing the importance degree, and the output vector is weighted according to the importance degree of part of codes in the code text. The code vector generated by the embodiment of the application not only can represent the front-back sequence relation of the codes in the code text, but also can weight the codes of important parts, and better analysis effect can be obtained by analyzing the code vector provided by the embodiment of the application.

Drawings

FIG. 1 is a block diagram of a developer platform according to an embodiment of the present application;

FIG. 2 is a representation of highlighting error codes in an embodiment of the present application;

FIG. 3 is a program display interface on a terminal device of a software developer;

FIG. 4 is an interface diagram for a manager to log into a developer platform to view;

FIG. 5 is a flowchart of a method for generating a code vector according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a forward LSTM network model in accordance with an embodiment of the present application;

FIG. 7 is a schematic diagram of a bidirectional LSTM network model in accordance with an embodiment of the present application;

FIG. 8 is a schematic diagram of the internal structure of neurons in a two-way LSTM network model;

FIG. 9 is a schematic diagram of a reverse LSTM network model in accordance with an embodiment of the application;

FIG. 10 is a diagram illustrating a process of converting an output vector into a code vector according to an embodiment of the present application;

FIG. 11 is a diagram illustrating the generation of code vectors from code segments according to an embodiment of the present application;

FIG. 12 is a diagram illustrating the conversion of a C language code segment into code vectors according to an embodiment of the present application;

FIG. 13 is a diagram illustrating the conversion of code text into a markup sequence according to an embodiment of the present application;

FIG. 14 is a diagram showing code vectors according to an embodiment of the present application;

FIG. 15 is a schematic diagram of an administrator looking at code vectors in an embodiment of the present application;

FIG. 16 is a schematic diagram of code analysis in an embodiment of the application;

FIG. 17 is a flowchart of an application example provided in an embodiment of the present application;

FIG. 18 is a schematic diagram of an apparatus for generating code vectors according to an embodiment of the present application;

FIG. 19 is a schematic diagram of an alternative embodiment of an apparatus for code vector generation in accordance with an embodiment of the present application;

fig. 20 is a schematic diagram of a server structure according to an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be appreciated that after the software developer writes the program code, the running program code may be tested. If an error (BUG) occurs in the code, the application program cannot run correctly, so that the product cannot be on line, and at this time, a software developer needs to perform deep analysis and inspection on the code. The conventional code inspection relies on manual work, and has the defects of high labor cost, long inspection time and no necessity of inspecting error codes.

In view of this, an embodiment of the present application provides a developer platform for performing inspection analysis on codes. Fig. 1 is a schematic diagram of a developer platform according to an embodiment of the present application. It can be seen that after the software developer writes the code, the code can be sent to the server of the developer platform, and the developer platform can analyze and check the code, and can issue online after confirming the code. After the program code is compiled by the software developer through the terminal equipment, the program code can be uploaded to the server, the server detects the program code, the server returns the analysis result to the terminal equipment for display after detecting and analyzing the program code, for example, the server identifies that a certain code section is a suspected error code section, and the position of the code in the program code is sent to the terminal equipment, so that the terminal equipment displays the code section or the suspected error code section is highlighted in the program code.

Fig. 2 is a display diagram of highlighting an error code in an embodiment of the present application, and it can be seen that, in a display interface of a terminal device displaying a program code, a box selection portion is a sequence or a code-marked portion, after a software developer observes the highlight portion in the code, the code can be modified for the highlight portion, without paying attention to a non-highlight portion, so as to help the software developer to improve efficiency.

In the embodiment of the present application, the program code sent by the software developer to the developer platform may be the code of a complete program, such as a complete mobile phone Application (APP), a complete computer program software or a complete service framework; the program code may also be applet code, functional module code embedded within an application program or an operating system kernel or the like; the program code may be a front end code of a web page, a back end code of a web page, or a code segment to be analyzed, etc., and in practical application, may be other codes, which is not limited herein.

In the embodiment of the application, the program code sent by the software developer to the developer platform is sent in the form of a data packet or in the form of a file, and in practical application, the program code can also be sent in an encrypted mode, and the method is not limited in the specification.

In the embodiment of the application, a software developer can send the program codes to the developer platform after the program codes are written, can also send the program codes to the developer platform at intervals of preset time in the process of writing the program codes, obtain real-time feedback of the developer platform and highlight wrong program codes in real time, and can also set a 'checking' virtual button on a software interface of the software developer for writing the codes in practical application, and when the software developer clicks the 'checking' virtual button, the terminal equipment sends the current program codes to a server for checking.

Fig. 3 shows a program display interface on a terminal device of a software developer, where the interface for writing the program has a title bar, a functional board and a main interface, where the software developer can write the program through the main interface, when the software developer wants to check an error in a program code, the software developer can click on a "check" virtual button in the functional board, and trigger a check instruction, and then the terminal device can send the program code to a server for checking according to the check instruction, and highlight an error code segment in the program code in the main interface according to a check result returned by the server.

It will be appreciated that terminal devices include, but are not limited to, cell phones, desktop computers, tablet computers, notebook computers, and palm top computers.

In the embodiment of the present application, code analysis may be code retrieval, code classification, code marking, code error correction, etc., where in the foregoing description, code error correction is taken as an example, and in practical application, other code analysis methods and methods for displaying code analysis results may also be used, for example, after a developer platform receives a program code sent by a software developer, code analysis is performed to obtain categories of different code segments, and then different code segments may be sent to a terminal device in different colors, so that the terminal device displays different code segments in different colors, for example, a main program is represented by red, an embedded function is represented by blue, etc., which is not limited herein.

The manager of the developer platform can log in the developer platform to check the program codes uploaded by the terminal equipment and also check the error code segments of the program codes. Fig. 4 is an interface diagram for a manager to log in a developer platform to view, and it can be seen that an interface displayed on the developer platform may have a title bar, a function block, and a main interface, and that a terminal device identifier, a code language type, and a program code may be displayed on the main interface. It can be understood that the developer platform on the server can accept the program codes uploaded by the plurality of terminal devices, then the program code analysis can be performed on the plurality of terminal devices, and the corresponding analysis results can be sent to the terminal devices.

In the embodiment of the present application, the developer platform may perform code analysis on multiple programming languages, for example, the programming language of the terminal device 1 is php, the programming language of the terminal device 2 is C, and the programming language of the terminal device 3 is computer programming language (Java), where in practical application, the developer platform may also process other computer programming such as c++ language, which is not limited herein.

It can be understood that after the developer platform receives the program code sent by the terminal device, the program code will be analyzed, if the traditional manual code inspection is adopted, the scale of the platform and the application will be greatly limited, and if the simple summed word vector is adopted, the code analysis effect will be poor. The embodiment of the application provides a code vector generation method and a related device, which improve the efficiency of code inspection and can achieve a better code analysis effect.

Fig. 5 is a method for generating a code vector according to an embodiment of the present application, including:

501. acquiring a first word vector sequence and a second word vector sequence corresponding to the code text, wherein the first word vector sequence is formed by arranging first word vectors to N-th word vectors in sequence, and the second word vector sequence is formed by arranging N-th word vectors to the first word vectors in sequence, and N is an integer larger than 1;

In the embodiment of the application, after the server acquires the code text, the corresponding word vector is generated according to the code text, which can be generated by a word vector tool (word 2vec tool). The word vectors are in the same order as the code text, i.e. a first word vector represents a first word or symbol of the code text, a second word vector represents a second word or symbol of the code text, and so on, to obtain a first word vector sequence, and then the server orders the first word vector sequence in reverse order to obtain a second word vector sequence, which can be expressed as:

a first word vector sequence= [ 1 st word vector, 2 nd word vector, … nth word vector ];

second word vector sequence= [ nth word vector, nth-1 word vector, … 1 st word vector ];

wherein N is an integer greater than 1.

In the embodiment of the present application, the word2vec tool may be obtained through an open source code, which is not described herein in detail. The word2vec tool generates vector representations of words through a probabilistic model, with the distance between the vectors reflecting the correlation between the meaning of the words. In practical applications, other algorithms for generating word vectors by natural language processing, such as fast text classification model (fastText), etc., may be used, which is not limited herein.

502. Obtaining an output vector sequence through a bidirectional long-short-term memory LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to a first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to a second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors;

in the embodiment of the application, the server inputs the first word vector sequence and the second word vector sequence into a bidirectional LSTM network model to obtain an output vector sequence. In the embodiment of the application, the model parameters of the bidirectional LSTM network model are trained in advance by the server, and the specific training process is not repeated, or the model parameters in the bidirectional LSTM network model can be obtained through network self-learning and are updated continuously.

Fig. 6 is a schematic diagram of a forward LSTM network model in an embodiment of the present application, and it can be seen that, after a server inputs a first word vector sequence into the forward LSTM network model, a first output sequence may be obtained, where the first output sequence is:

First output sequence= [ 1 st forward output vector, 2 nd forward output vector, … nth forward output vector ];

the 1 st forward output vector is calculated by the server according to the 1 st word vector through a forward LSTM network model, the 2 nd forward output vector is calculated by the server according to the 2 nd word vector through a forward LSTM network model, and the … nth forward output vector is calculated by the server according to the nth word vector through a forward LSTM network model.

It will be appreciated that the two-way LSTM network model may communicate memory information between neurons by learning, thereby remembering information over a long period of time, as will be described in detail below.

FIG. 7 is a schematic diagram of a bidirectional LSTM network model in an embodiment of the application, in which the server will sequence [ X ] _t-1 ,X _t ,X _t+1 ]After inputting the bidirectional LSTM network model (in the embodiment of the present application, sequence X _t-1 ,X _t ,X _t+1 ]Can be a first word vector sequence or a second word vector sequence), and the cell state C is obtained by calculation of neurons in a bidirectional LSTM network model _t Neuron output h _t Cell state C _t Neuron output h _t Transfer to the next neuron to participate in the calculation of the next neuron, and neuron output h _t Can be used as the output of the bidirectional LSTM network model.

FIG. 8 is a diagram of a bidirectional LSTM network modelSchematic of the internal structure of neurons; it can be seen that the sequence input into the bidirectional LSTM network model is subjected to four-layer operation to obtain the cell state C _t Neuron output h _t Wherein the neuron outputs h _t Can be used as the output of the bidirectional LSTM network model. In neurons of the bidirectional LSTM network model, C _t-1 H is the cell state of the last neuron _t-1 For the output of the last neuron, each layer of operation will be described in detail below.

1. The first layer (forgetting layer) is calculated to determine what information is forgotten in the cell state, and a specific calculation formula is as follows:

f _t ＝σ(W _f ·[h _t-1 ,X _t ]+b _f )；

wherein f _t Indicating the degree of forgetting the previous information transmitted, f _t Take a value between 0 and 1, the neuron will h _t-1 And X _t Spliced together and transmitted to a sigmoid function, matrix W _f Vector b _f Parameters are self-learned for the neural network.

2. The second layer (sigmoid layer) and the third layer (tanh layer) are a combination of a sigmoid function and a tanh function, which indicates which information needs to be added to the cell state, and the specific calculation formula is as follows:

C' _t ＝tanh(W _c ·[h _t-1 ,X _t ]+b _c )；

wherein C' _t For output of tanh layer, W _c And b _c Parameters are self-learned for the neural network. the tanh layer is used to generate a candidate C 'for updating the value' _t The output is within the interval [ -1,1]Above, it is stated that the cell state needs to be strengthened in some dimensions and weakened in some dimensions.

i _t ＝σ(W _i ·[h _t-1 ,X _t ]+b _i )；

Wherein i is _t For output of sigmoid layer, W _i ,b _i Parameters are self-learned for the neural network. The sigmoid layer performs scaling function on the tanh layer, i _t In interval [0,1 ]]On, i _t 0 represents the cell state in the corresponding dimensionNo update is required, and the information of the current neuron is not saved; i.e _t A 1 indicates that the cell state in this dimension needs to be updated in its entirety, and the information of the previous neurons is discarded in its entirety.

Next, a new cell state C of the neuron is determined _t C is carried out by _t-1 And f _t Multiplying to discard a part of information and then adding a part i needing updating _t *C' _t New cell state C is generated _t ：

C _t ＝f _t *C _t-1 +i _t *C' _t ；

3. The last fourth layer is the output layer for determining the output of the current neuron. The output value is related to the cell state, and the previously obtained C _t The candidates that are input to a tanh function to get the output value, which parts of the candidates will ultimately be output, are determined by a sigmoid layer:

O _t ＝σ(W _o ·[h _t-1 ,X _t ]+b _o )；

h _t ＝O _t *tanh(C _t )；

wherein O is _t Output expressed as sigmoid function, h _t Is the output of the neuron. Wherein W is _o ,b _o Parameters are self-learned for the neural network.

Through the calculation of the first layer to the fourth layer, the input X of the neuron can be calculated _t Calculating the output h of the neuron _t The first output sequence or the second output sequence can be obtained through calculation of all neurons in the bidirectional LSTM network model.

It will be appreciated that the excitation functions used in the bidirectional LSTM neural network model are currently tanh and sigmoid functions, and that other excitation functions may be used in practical applications, and are not limited herein.

Fig. 9 is a schematic diagram of a reverse LSTM network model in an embodiment of the present application, and it can be seen that after the server inputs the second word vector sequence into the reverse LSTM network model, a second output sequence may be obtained, where the second output sequence is:

second output sequence= [ 1 st reverse output vector, 2 nd reverse output vector, … nth reverse output vector ];

the 1 st reverse output vector is calculated by the server according to the 1 st word vector through a reverse LSTM network model, the 2 nd reverse output vector is calculated by the server according to the 2 nd word vector through a reverse LSTM network model, and the … nth reverse output vector is calculated by the server according to the nth word vector through a reverse LSTM network model.

In the embodiment of the application, the neuron structure of the reverse LSTM network model is the same as that of the forward direction, but the state is transferred, and the data input direction is opposite to the forward direction.

After the server inputs the first word vector sequence and the first word vector sequence into the bidirectional LSTM network model, a first output sequence and a second output sequence are obtained, and then the server can splice the first output sequence and the second output sequence together, so that an output vector sequence is obtained. The output vector sequence is formed by arranging output vectors, wherein the output vectors are positive output vectors, and the output vectors are reverse output vectors, namely, the output vectors are formed by splicing the positive output vectors and the reverse output vectors. The 1 st output vector is formed by splicing the 1 st forward output vector and the 1 st reverse output vector, the 2 nd output vector is formed by splicing the 2 nd forward output vector and the 2 nd reverse output vector, and the … nth output vector is formed by splicing the nth forward output vector and the nth reverse output vector. Finally, the server obtains an output vector sequence by splicing:

output vector sequence= [ 1 st output vector, 2 nd output vector … nth output vector ];

503. calculating the corresponding score of the output vector according to the output vector sequence and the weight vector;

in the embodiment of the present application, after the server obtains the output vector sequence, the score corresponding to the output vector is calculated according to the output sequence and the weight vector, and it can be understood that each output vector corresponds to a score, that is, the 1 st output vector corresponds to the 1 st score, the 2 nd output vector corresponds to the 2 nd score … nth output vector corresponds to the nth score, and the server can calculate N scores altogether.

Specifically, the server may calculate the score corresponding to the output vector by the output vector sequence and the weight vector according to a softmax function, and in the deep learning field, the softmax function maps the outputs of a plurality of neurons to the (0, 1) interval, and it can be understood that the larger the value is, the larger the importance of the neuron output is indicated.

In practical applications, the server may also use different activation functions to calculate the score corresponding to the output vector, for example, an MLP activation function, a bilinear activation function, etc., which is not limited herein.

504. And generating a code vector corresponding to the output vector sequence according to the score corresponding to the output vector.

In an embodiment of the present application, step 503 and step 504 may be referred to as Attention mechanism (Attention mechanism) for analyzing the importance of each word in the whole sentence and giving higher weight to the more important words in the sentence.

FIG. 10 is a diagram illustrating a process of converting an output vector into a code vector according to an embodiment of the present application. It can be seen that the forward output vector and the reverse output vector are spliced into one output vector, for example, the 1 st output vector is formed by splicing the 1 st forward output vector and the 1 st reverse output vector, the 2 nd output vector is formed by splicing the 2 nd forward output vector and the 2 nd reverse output vector, and the … nth output vector is formed by splicing the nth forward output vector and the nth reverse output vector. After obtaining the output vector, the server inputs the output vector to the attention mechanism for operation to obtain the code vector.

In the embodiment of the application, after obtaining the score corresponding to the output vector, the server may multiply the score corresponding to the output vector with the output vector to obtain the code vector:

code vector= [ 1 st fraction multiplied by 1 st output vector, 2 nd fraction multiplied by 2 nd output vector … nth fraction multiplied by nth output vector ];

the server gives each output vector different weight according to the score to weight, so that important logic words in the code segments can be given higher weight, and the code vectors with higher quality can be generated.

In the embodiment of the application, the code vector generated by the server is shown in fig. 11, and fig. 11 is a schematic diagram of the code vector generated by the code segment in the embodiment of the application, and it can be seen that after the server obtains the code text, the code vector is finally obtained through the processing of the method in the embodiment of the application, and the numerical value in the code vector is generally between-1 and 1 through the fractional weighting processing, so that the subsequent processing is convenient. In fig. 11, the left part is a php code segment, and the right part is a code vector obtained by conversion, and it can be seen that after the server converts the code segment into the code vector, the code can be processed in a vector form.

Fig. 12 is a schematic diagram of converting a C language code segment into a code vector according to an embodiment of the present application, and it can be seen that a server obtains whether a php code segment or a C language code segment can convert the php code segment into the code vector, and in practical application, the server can also convert codes of other language types such as Java, which will not be described herein. In fig. 12, the left side is the code segment of the C language, and the right side is the code vector obtained by conversion, and it can be seen that after the server converts the code segment into the code vector, the code can be processed in the form of a vector.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where obtaining a first word vector sequence and a second word vector sequence corresponding to a code text includes: acquiring a code text; converting the code text into a tag sequence, wherein the tag sequence is formed by converting each word or symbol in the code text; generating N word vectors through a word vector tool according to the mark sequence to obtain a first word vector sequence; and arranging the first word vector sequences in an inverted order to obtain second word vector sequences.

In the embodiment of the present application, the server first obtains the code text, where the form of the code text is shown in fig. 2, and is not described herein.

After the server obtains the code text, the code text may be converted into a sequence of labels. The tag sequence is also referred to as a token sequence, which is a code segment having a type that can determine a semantic representation (e.g., a keyword, a string, or a comment) of text, can be obtained using a conventional lexical analyzer pygments, or can be obtained using a modified lexical analyzer pygments, and is not limited in this regard.

Fig. 13 is a schematic diagram of converting a code text provided in the embodiment of the present application into a tag sequence, and it can be seen that after the server converts the code text on the left side of fig. 13, the tag sequence on the right side of fig. 13 is obtained. The tag sequence includes a plurality of words and symbols, according to the words and symbols, the server may generate word vectors through a word vector tool (word 2vec tool) to obtain a first word vector sequence, where the first word vector sequence is the same as the first word vector sequence in each embodiment corresponding to fig. 5, and is not described herein again.

Specifically, the server may sequentially input each token (word or symbol) in the sequence of tokens into the word vector tool, and the server solves its corresponding vector representation, i.e., word vector, for each token by computing from the input tokens. The dimension of the word vector may be set manually. The markers in the marker sequence may be repeated, for example, the markers on the right side of FIG. 13 have multiple "variable assignment" markers, all of the same marker. Each label has a corresponding association with each word vector, e.g., 1 st label corresponds to 1 st word vector, 2 nd label corresponds to 2 nd word vector … nth label corresponds to nth word vector.

After the server obtains the first word vector sequence, the first word vector sequence may be ordered in reverse order to obtain the second word vector sequence.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where after generating, according to a score corresponding to the output vector, a code vector corresponding to the output vector sequence, the method further includes: the code vector is sent to the terminal device so that the terminal device presents the code vector.

In the embodiment of the application, after the server obtains the code vector, the code vector can be sent to the terminal equipment, so that the terminal equipment is displayed for a developer to see. In particular, the server may send the code vector to a terminal device of the software developer, which terminal device presents the code vector, e.g. the terminal device displays the code vector on a program interface of the software developer.

FIG. 14 is a diagram showing code vectors in an embodiment of the present application. It can be seen that, when the terminal device receives the code vector sent by the server, the code vector may be displayed on the display screen, specifically, may be displayed in programming software being used by a software developer or in other clients, where the interface of the client is shown in fig. 14, and includes a title bar, a functional board, and so on, which are not described herein again.

It will be appreciated that an administrator of the developer platform may log into the server to view the code vector. For example, after the manager logs into the server, if the manager wishes to view the code vector in one of the terminal devices, the virtual button of "display code vector" may be clicked to view the code vector. After receiving the signal that the manager clicks the virtual button for displaying the code vector, the server triggers an instruction for displaying the code vector and displays the code vector of the terminal device.

Fig. 15 is a schematic diagram of a manager viewing a code vector according to an embodiment of the present application, and it can be seen that, after the manager clicks a virtual button for "displaying the code vector", the code vector is displayed on a screen. The position of the code vector displayed is below the code text, and in practical application, the code vector may be located on the right or other positions, which will not be described herein.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where after generating, according to a score corresponding to the output vector, a code vector corresponding to the output vector sequence, the method further includes: performing code analysis according to the code vector to generate a code analysis result; and sending the code analysis result to the terminal equipment, so that the terminal equipment displays the code analysis result.

In the embodiment of the present application, the server may perform code analysis according to the code vector, where the code analysis may be code retrieval, code classification, code marking, code error correction, and the like.

FIG. 16 is a diagram illustrating code analysis in an embodiment of the present application. For convenience of description, the code text and the code vector are denoted by X in fig. 16, and in practical application, the code text and the code vector are shown in other embodiments. The server first converts the code text into code vectors, then projects the n-dimensional code vectors to a two-dimensional plane through principal component analysis (principal component analysis, PCA), i.e., converts the n-dimensional code vectors into two-dimensional vectors, and solves for similarities, such as cosine similarities, between the two-dimensional code vectors. As shown in fig. 16, the projection distance of the code text 1 and the code text 2 is closer, namely, the cosine similarity is lower, which indicates that the code vectors of the code text 1 and the code text 2 are similar, and thus, the code text 1 and the code text 2 are inferred to be similar.

In the embodiment of the application, the server may preset the code text 1 as an error code text, then acquire the code text 2 from the terminal device 2, acquire the code text 3 from the terminal device 3, calculate the code vectors and the similarity of the code text 1, the code text 2 and the code text 3, firstly calculate the similarity between the code text 2 and the code text 1, and if the similarity meets the preset condition, infer that the code text 2 is similar to the code text 1, thereby determining that the code text 2 is an error code text, and send the code text 2 to the terminal device, so that the terminal device highlights the code text 2. If the similarity between the code text 3 and the code text 1 does not meet the preset condition, it can be inferred that the code text 3 is dissimilar to the code text 1, thereby indicating that the code text 3 is not an error code text.

In the embodiment of the present application, the server may preset the code text 2 as a class a code text, and the code text 3 as a class B code text. The server acquires the code text 1 from the terminal device 1, calculates the code vectors and the similarity of the code text 1, the code text 2 and the code text 3, firstly calculates the similarity of the code text 1 and the code text 2, deduces that the code text 1 is similar to the code text 2 if the similarity meets a preset condition, and accordingly determines that the code text 1 is a class-A code text, and determines that the code text 1 is a class-B code text if the similarity of the code text 1 and the code text 2 meets the preset condition.

It can be understood that the preset condition corresponding to the similarity may be specifically set according to the actual situation, for example, when the similarity is cosine similarity, the lower the cosine similarity is, the more similar the two code vectors are, and thus the preset condition may be set that the cosine similarity is smaller than the preset threshold. In practical applications, other preset conditions may be set, which are not limited herein.

The server may select a piece of code text in the entire program code for code analysis. If the analysis results in that the code text is an error code text, the analysis result can be sent to a terminal device for display, for example, the code text is highlighted, or when the manager logs in the server for viewing, the code text is highlighted. Taking fig. 3 as an example, after receiving the analysis result of the server, the terminal device highlights the code text according to the code text identifier corresponding to the analysis result. The method comprises the steps of FIG. 3, including a title bar, a function plate and a main interface, wherein a frame selection part in the main interface is a highlight part, and error code text in codes can be timely found according to the highlight part in software development, so that the efficiency of a software developer is improved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where calculating a score corresponding to an output vector according to an output vector sequence and a weight vector includes: determining a weight fraction corresponding to the output vector according to the output vector and the weight vector; determining the total weight score corresponding to the output vector sequence according to the output vector sequence and the weight vector corresponding to each output vector; and determining the score of the output vector according to the weight score and the weight total score.

In the embodiment of the application, the server can multiply the output vector with the weight vector to obtain the weight fraction corresponding to the output vector, and the weight vector is self-learned by the network. Parameters in the bidirectional LSTM neural network model and the weight vector can be obtained through training, the server automatically carries out random assignment on the weight vector and the parameters in the model at the beginning of training, and in the process of training input data, the weight vector and the parameters in the model can be self-adjusted according to a training result objective function and the like, so that a prediction result of the model is closer to a real result, and a code analysis effect is better.

It can be understood that the server may multiply the output vector with the weight vector to obtain a weight score corresponding to the output vector, and determine a total weight score corresponding to the output vector sequence according to the output vector sequence and the weight vector corresponding to each output vector; according to the weight score and the weight total score, the score of the output vector is determined, and the specific calculation formula is as follows:

wherein a is _j To output a fraction of vector j, w ^h As a weight vector, w ^hT H is the transposition of the weight vector _j For the output vector j to be output,the transpose of the weight vector is multiplied by each output vector and summed.

It can be seen that, in the embodiment of the present application, the server may map the values of all weight vectors hj to the (0, 1) interval by calculating the score corresponding to each output vector, which is equivalent to the probability distribution of the weight vector hj, and the token label corresponding to hj with a larger probability value, that is, the text corresponding to the code text, has a larger importance in the sentence; the token mark corresponding to the weight vector hj with smaller probability value has smaller importance of the corresponding code text in the sentence. Through this calculation, some of the code can be given some of the weightThe code text representing the logical and structural relationships of the code is given a greater weight, while some variable names, punctuation marks, etc. are given a lesser weight. Weight vector w self-learned by this network ^h The network automatically learns the importance of each token tag in the code based on the logical and structural relationships of the entered code.

In the embodiment of the application, the server calculates the scores in the modes of multiplication, summation, division and the like, and in practical application, the weight scores, the total weight scores and the scores corresponding to the output vectors can be calculated through other algorithms, and the method is not particularly limited.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 5, an embodiment of the present application further provides an optional embodiment of a method for generating a code vector, where generating, according to a score corresponding to an output vector, a code vector corresponding to an output vector sequence includes: obtaining a block vector by multiplying the fraction with the output vector; the block vectors are concatenated to form a code vector.

In the embodiment of the application, after the server calculates the score of the output vector, the output vector and the score can be multiplied and spliced to form the code vector, namely the weighted average, and the specific formula is as follows:

wherein c is a code vector, a _j To output the fraction of the vector j, h _j For the output vector j, a code vector c is obtained by adding the weighted output vectors, and the code vector c is:

code vector c= [ a ] ₁ * 1 st output vector, a ₂ * Output vector … a 2 _n * Nth output vector]；

In the embodiment of the application, the server obtains the code vector according to the score and the output vector in a simple splicing manner, and in practical application, other algorithms can be used for obtaining the code vector according to the score and the output vector, and the method is not limited in particular.

In the embodiment of the application, after the server receives the code text, the code text is obtained through operation, and as shown in fig. 11 and 12, the left side of fig. 11 and 12 is the code text, and the right side is the code vector converted by the server. Wherein fig. 11 is php code text and fig. 12 is C language code text.

According to the embodiments corresponding to fig. 5, the embodiments of the present application further provide applications such as the following:

FIG. 17 is a flowchart of an application example provided in an embodiment of the present application; therefore, in the application example provided by the embodiment of the application, the method comprises six steps:

step one, a server acquires an original code, judges whether the original code is a marking sequence, and if not, carries out lexical analysis to obtain the marking sequence;

it can be understood that after the server obtains the data sent by a section of terminal device, it can determine whether the data is a code text or a token sequence, if the data is a token sequence, the second step may be performed, and if the data is not a token sequence, the code text needs to be converted into the token sequence, generally, the conversion may be performed through lexical analysis.

The terminal device can directly send the code text to the server, or can send the marking sequence to the server after lexical analysis, and the terminal device can lighten the burden of the server after lexical analysis.

Step two, the server generates word vectors according to the marking sequence;

it can be appreciated that the server may generate a word vector according to the tag sequence, typically by a word vector tool, which may be found on an open source platform, and detailed description thereof is omitted herein.

Step three, the server inputs the word vector sequence into a forward LSTM network;

step four, the server inputs the word vector sequence into a reverse LSTM network;

it should be noted that, when the server inputs the word vector sequence into the forward LSTM network and the reverse LSTM network, when the word vector sequence is input into the forward LSTM network, the order of the word vector sequence is the same as the state transfer direction of the memory cell, and when the word vector sequence is input into the reverse LSTM network, the order of the word vector sequence is opposite to the state transfer direction of the memory cell, and in particular, the description of each embodiment corresponding to fig. 6 and fig. 9 may be referred to, and will not be repeated here.

Step five, the server splices the outputs of the forward LSTM network and the reverse LSTM network to obtain an output vector;

After the server splices the vectors, a sequence of output vectors is obtained, wherein the sequence of output vectors comprises a plurality of output vectors.

And step six, the server converts the output vector into a code vector through an attention mechanism.

It will be appreciated that the server first calculates the score corresponding to the output vector and then weights the output vector according to the score to obtain the code vector. Attention mechanism (Attention mechanism) for analyzing the importance of each word in the whole sentence, giving higher weight to the more important words in the sentence.

Machine learning algorithms that perform code analysis generally require input into a digital vector representation, so converting code segments into code vectors is the basis for making them applicable to various machine learning based algorithms. The method of the embodiment of the application can extract useful logic information in the context of codes of various languages and any length to generate the code vector representation, thereby leading a computer to understand the corresponding code segment through the vector.

The server-generated code vectors may be used as inputs to machine learning tasks such as code retrieval, code classification, code tagging, code error correction, etc., or as metrics for detecting similarity between cloned code segments.

Fig. 18 is a schematic diagram of an apparatus for generating a code vector according to an embodiment of the present application, referring to fig. 18, an apparatus 1800 for generating a code vector according to an embodiment of the present application includes:

an obtaining unit 1801, configured to obtain a first word vector sequence and a second word vector sequence corresponding to the code text, where the first word vector sequence is formed by sequentially arranging first word vectors to nth word vectors, and the second word vector sequence is formed by sequentially arranging nth word vectors to first word vectors, and N is an integer greater than 1;

the processing unit 1802 is configured to obtain an output vector sequence through a bidirectional long-short-term memory LSTM network model, where the bidirectional LSTM network model includes a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is configured to generate a first output sequence corresponding to a first word vector sequence, the reverse LSTM network model is configured to generate a second output sequence corresponding to a second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors;

the processing unit 1802 is further configured to calculate a score corresponding to the output vector according to the output vector sequence and the weight vector;

A generating unit 1803, configured to generate a code vector corresponding to the output vector sequence according to the score corresponding to the output vector.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 18, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the obtaining unit 1801 is further configured to:

acquiring a code text;

generating N word vectors through a word vector tool according to the mark sequence to obtain a first word vector sequence;

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 18, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the apparatus 1800 for generating a code vector further includes:

a transmitting unit 1804 is configured to transmit the code vector to the terminal device, so that the terminal device displays the code vector.

Fig. 19 is a schematic diagram of an alternative embodiment of an apparatus for generating a code vector in an embodiment of the present application, where it is seen that the sending unit 1804 is connected to the generating unit 1803, and is configured to send the code vector to a terminal device, so that the terminal device displays the code vector.

Optionally, on the basis of the respective embodiments corresponding to fig. 18, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the generating unit 1803 is further configured to: performing code analysis according to the code vector to generate a code analysis result;

the transmitting unit 1804 is further configured to: and sending the code analysis result to the terminal equipment so that the terminal equipment displays the code analysis result.

Optionally, on the basis of the respective embodiments corresponding to fig. 18, an embodiment of the present application further provides an alternative embodiment of the apparatus for generating a code vector, and the processing unit 1802 is further configured to:

determining the total weight score corresponding to the output vector sequence according to the output vector sequence and the weight vector corresponding to each output vector;

Optionally, on the basis of the respective embodiments corresponding to fig. 18, an embodiment of the present application further provides an optional embodiment of an apparatus for generating a code vector, where the generating unit 1803 is further configured to:

obtaining a block vector by multiplying the fraction with the output vector;

The block vectors are concatenated to form a code vector.

Fig. 20 is a schematic diagram of a server structure according to an embodiment of the present application, where the server 2000 may have a relatively large difference between configurations or performances, and may include one or more central processing units (central processing units, CPU) 2022 (e.g., one or more processors) and a memory 2032, and one or more storage media 2030 (e.g., one or more mass storage devices) storing application programs 2042 or data 2044. Wherein the memory 2032 and the storage medium 2030 may be transitory or persistent. The program stored on the storage medium 2030 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 2022 may be arranged to communicate with a storage medium 2030, and execute a series of instruction operations in the storage medium 2030 on the server 2000.

The server 2000 may also include one or more power supplies 2026, one or more wired or wireless network interfaces 2050, one or more input/output interfaces 2058, and/or one or more operating systems 2041 such as Windows server (tm), mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 20.

In the embodiment of the present application, the CPU2022 is specifically configured to perform the following steps:

acquiring a first word vector sequence and a second word vector sequence corresponding to the code text, wherein the first word vector sequence is formed by arranging first word vectors to N-th word vectors in sequence, and the second word vector sequence is formed by arranging N-th word vectors to the first word vectors in sequence, and N is an integer larger than 1;

obtaining an output vector sequence through a bidirectional LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to a first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to a second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors;

In an embodiment of the present application, the CPU2022 is further configured to perform the following steps:

acquiring a code text;

the code vector is sent to the terminal device so that the terminal device presents the code vector.

and sending the code analysis result to the terminal equipment so that the terminal equipment displays the code analysis result.

obtaining a block vector by multiplying the fraction with the output vector;

the block vectors are concatenated to form a code vector.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method of code vector generation, comprising:

Determining the score of the output vector according to the weight score and the weight total score;

multiplying the fraction with the output vector to obtain a block vector;

and splicing the block vectors to form code vectors corresponding to the output vector sequence.

2. The method of claim 1, wherein the obtaining the first word vector sequence and the second word vector sequence corresponding to the code text comprises:

acquiring the code text;

3. The method of claim 1, wherein after concatenating the block vectors to form a code vector corresponding to the sequence of output vectors, the method further comprises:

4. The method of claim 1, wherein after concatenating the block vectors to form a code vector corresponding to the sequence of output vectors, the method further comprises:

5. An apparatus for generating a code vector, comprising:

the processing unit is used for acquiring an output vector sequence through a bidirectional LSTM network model, wherein the bidirectional LSTM network model comprises a forward LSTM network model and a reverse LSTM network model, the forward LSTM network model is used for generating a first output sequence corresponding to the first word vector sequence, the reverse LSTM network model is used for generating a second output sequence corresponding to the second word vector sequence, the output vector sequence is formed by splicing the first output sequence and the second output sequence, and the output vector sequence is formed by arranging output vectors;

The processing unit is further used for determining a weight fraction corresponding to the output vector according to the output vector and the weight vector;

determining a weight total score corresponding to the output vector sequence according to the output vector sequence and the weight vector corresponding to each output vector; determining the score of the output vector according to the weight score and the weight total score;

a generating unit, configured to multiply the score with the output vector to obtain a block vector; and splicing the block vectors to form code vectors corresponding to the output vector sequence.

6. The apparatus according to claim 5, wherein the acquisition unit is specifically configured to:

acquiring the code text;

7. The apparatus according to claim 5, further comprising a transmitting unit for transmitting the code vector to a terminal device such that the terminal device presents the code vector.

8. The apparatus of claim 5, wherein the generating unit is further configured to perform code analysis according to the code vector to generate a code analysis result;

and the sending unit is used for sending the code analysis result to the terminal equipment so that the terminal equipment displays the code analysis result.

9. A server, the server comprising: memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

multiplying the fraction with the output vector to obtain a block vector;

splicing the block vectors to form code vectors corresponding to the output vector sequences;

10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 4.

11. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 4.