GB2575496A

GB2575496A - Runtime analysis

Info

Publication number: GB2575496A
Application number: GB1811477.7A
Authority: GB
Inventors: Atkinson Liam; Marnette Bruno
Original assignee: Prodo Tech Ltd
Current assignee: Prodo Tech Ltd
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2020-01-15
Also published as: WO2020012196A1; GB201811477D0

Abstract

A method of training a machine learning model, and method, and system of use thereof to statically analyse and predict execution of program code. The method comprises receiving input code, applying one or more tracers to the code, collecting runtime information in relation to the tracers and generating a trained machine learning model based on the runtime information. The method and system of using the machine learning model comprises the steps of receiving computer code, detecting one or more elements of runtime data in relation to the code, predicting the execution of the code using the trained machine learning model and outputting the predication. The runtime information related to the tracers may be internal contradictions, inconsistencies, errors, variable mismatches or variable discrepancies. The machine learning model may be a neural network or a classifier and can annotate code with metadata in relation to the nature or runtime value of a component. The input code may be a labelled dataset. The program code may be represented in a graphical format during analysis comprising one or mode labelled nodes.

Description

RUNTIME ANALYSIS

Field

The present invention relates to a method, apparatus, and system to analyse and predict the execution of computer code. More particularly, the present invention relates to the teaching of machines to review static code and predict and prevent possible issues at the code review stage, before the code is deployed and executed in production.

Background

The process of writing and developing computer code can be arduous and require extensive testing. When an error is output, it can be very time consuming to find the source of the error and apply a repair. Errors can arise from many different sources, potentially simultaneously, or can be caused by the interference between a plurality of different sections of computer code. Unforeseen consequences of one or more aspects of the computer code can create a cascading series of errors, which may severely limit the usefulness of any product using said code.

Summary of Invention

Aspects and/or embodiments seek to provide a method, apparatus, and system to analyse and predict the execution of computer code.

According to a first aspect, there is provided a method of training a machine learning model for predicting the execution of computer code, the method comprising the steps of: receiving input computer code; applying one or more tracers to the computer code; collecting runtime information in relation to the one or more tracers; and generating a machine learning model based on the runtime information.

It is an aim that the arrangement disclosed herein is operable to have input a source file and infer one or more properties in relation to the source code of that source file, without actually running the source code. Therefore, one or more issues in relation to the source code may be discovered in advance. For example, contradictions in the code may be revealed and/or highlighted. In one embodiment this may be performed via finding a type mismatch, or a variable that is incorrectly used while having the correct type. Alternatively, a function may expect an integer argument, but has instead been input a string. In a further example, a function may have been input an integer but is expecting that integer to describe the height of an object, whereas it is actually the age of an object. This type of mistake may be inferred from the names of variables, even if they have the same type. In principle this arrangement of model could also be used to identify issues relating to optimisation of code as well.

Optionally, the one or more tracers relates to data regarding one or more variables. Optionally, the one or more variables comprise one or more of: a number; a string of characters; an array; a function and/or an argument. Optionally, the one or more functions comprise one or more function return types. Optionally, the one or more arguments comprise one or more argument types. Optionally, runtime information in relation to the one or more tracers comprises data regarding one or more of: internal contradictions and/or inconsistencies; errors; variable mismatches; and/or variable discrepancies.

Conventionally used computer programming languages, and hence any code generated, may be unforgiving in respect of variable errors. For example, if a particular function expects to receive a string and instead is provided with a numerical value, the entire function may cease to perform as expected. This can have a severe effect on any subsequent or parallel functions, and may cause the entire arrangement using that function to not perform as desired. Such errors may be easily introduced by accident, and can be difficult to detect once programmed. Therefore, it is advantageous if such errors can be detected.

Optionally, the machine learning model comprises a neural network. Optionally, the machine learning model is a classifier. Optionally, the machine learning model is operable to annotate code with metadata in relation to one or more of: the nature and/or runtime value of a component.

Machine learning models, in particular neural networks, can provide a powerful and adaptable tool to perform the method as disclosed herein. Data may be processed significantly faster than using conventional tools, and the results optimised according to a specific requirement. The metadata may comprise information about the type of a variable, but may instead or additionally comprise information in relation to the numerical value of an integer.

A practical use case and example for such an arrangement may be when neural network code is implemented, for example using PyTorch orTensorFlow. A common error which is made is to align multiple layers of neurons of different sizes. If a machine learning powered analyser such as disclosed herein was operable to infer that the output size of one layer is 72 but the input size of the next layer is 73, then a flag may be raised to provide an alert to a potential inconsistency. The machine learning model may also be able to infer one or more elements of information about the effects of different elements in the code. For example, it could predict whether a function is “pure” or has side effects. Such side effects may include the function muting some of its arguments and hence rendering the function more dangerous to use.

Optionally, wherein the input comprises a labelled dataset.

Labelled datasets can provide a valuable training tool for any machine learning arrangement. In time, the machine learning arrangement can learn to label a new, unlabelled dataset based on the labels from the previously provided labelled datasets.

Optionally, the arrangement disclosed herein further comprises the step of: outputting a compiled set of data in relation to the predicted execution of the input computer code.

The results of any analysis formed by the machine learning arrangement may be of interest to a programmer or similar developer. Errors may be scrutinised and corrected, and hence programs overall may be improved. Therefore, it is advantageous to provide such a summary of the findings.

Optionally, there is provided a computer implemented method of predicting the execution of computer code comprising the application of a machine learning model trained according to the method disclosed herein.

According to a further aspect, there is provided a computer implemented method of predicting the execution of computer code, the method comprising the steps of: receiving input computer code; detecting one or more elements of runtime data in relation to the input computer code; predicting the execution of the input computer code using the one or more elements of runtime data in combination with a trained machine learning model; outputting one or more predictions based on an output of the trained machine learning model. Optionally, the output comprises an analysis of predicted execution of the input computer code.

In order for a trained machine learning arrangement to be of use, it may be necessary to provide a method within which any relevant analysis is performed. Therefore, the advantages of such an arrangement may be fully utilised.

Optionally, the arrangement disclosed herein further comprises the steps of: representing the input computer code in a graphical format comprising one or more nodes; and labelling the one or more nodes in the graphical representation.

Graph based models are models which are based on graph theory. A graphical model, also referred to as a probabilistic graphical model or a structured probabilistic model, shows the conditional dependence structure between random variables. This probabilistic model is expressed through a graph and hence may be used to more easily define the structure of a machine learning problem. Conventionally two separate branches of graphical representations of distributions are commonly used. These branches comprise Bayesian networks and Markov random fields.

Optionally, the labelling of the one or more nodes comprises labelling in relation to one or more items of runtime data. Optionally, the trained machine learning model comprises a machine model trained according to the method disclosed herein.

Runtime data may be extremely relevant to the processing of computer code, and hence useful to include when labelling one or more nodes.

According to a further aspect, there is provided a computer code prediction tool, comprising: an input module operable to receive an input comprising computer code; a processor comprising a trained machine learning model operable to: detect one or more elements of runtime data in relation to the input; and predict the execution of the input using the one or more elements of runtime data in combination with the trained machine learning model; and an output module operable to generate an output comprising one or more predictions in relation to the input.

According to a further aspect, there is provided a system for predicting the execution of computer code of incorrect tokens in computer code, comprising: an input module operable to receive an input comprising computer code; a processor comprising a trained machine learning model operable to: detect one or more elements of runtime data in relation to the input; and predict the execution of the input using the one or more elements of runtime data in combination with the trained machine learning model; and an output module operable to generate an output comprising one or more predictions in relation to the input.

Brief Description of Drawings

Embodiments will now be described, by way of example only and with reference to the accompanying drawings having like-reference numerals, in which:

Figure 1 shows a high-level overview of the production of a graph; and

Figure 2 shows a classification flow chart comprising a mathematical model.

Specific Description

Referring to Figures 1 and 2, a first embodiment will now be described.

It is understood that a machine learning (ML) arrangement can only infer useful information if it is provided with the appropriate data and equipped with a sufficiently powerful computer model. The computer model can comprise a machine learning model, for example a neural network. Therefore, it is advantageous to collect a large amount of data regarding computer code, with a sufficient coverage and/or density of labels providing meaningful information in relation to the runtime activity of the code. It may further be advantageous for the machine learning arrangement to be operable process this information efficiently, thereby reducing the time taken and computational expense required.

Computer code can be accessed from free repositories on the internet, such as GitHub. Alternatively, or in addition, computer code may be furnished directly by a user thereby providing the user with the associated developer tools disclosed herein. Code bases comprising tests and test coverage of relatively high quality can provide a stronger source of training material for the machine learning model. Tracers are then added into the computer code, also referred to as source code, provided to the machine learning model. A compiler can alternatively or in addition be modified to add tracers when the source code is compiled. One or more tests are then run, and information in relation to the run time is collected. Such information may comprise tracing type information, for example checking whether a variable is a number, a string of characters, or an array. It is understood that this is a non-limiting example, and such information can comprise a range of other data relating to the testing of computer code. The code, annotated with these types observed in runtime, is then provided to the machine learning model.

Open source repositories on GitHub can store code comprising tests. The running of these tests can be automated using the arrangement disclosed herein. Tracing functions are added to the code in the repository. These allows for the output of runtime information about the code when it is run. The tests themselves may then be run, and runtime information for the parts of the repository that are used by the tests may be extracted. The arrangement thereby effectively automates the labelling of runtime type information for the nodes in a graph representation of a program. The source code may be formed into the graphical representation using JavaScript (JS), as shown in particular in Figure 1.

Source code model may further be carefully modelled when provided into one or more neural networks. A model may be used that resembles conventional graph-based models, but comprising notable differences in relation to:

• The information encoded into different edge types;

• How the model is taught to pay attention to different edges;

• How multiple edge types are processed in parallel; and • How strings, for example variable names, are encoded.

The notable differences are as follows:

• The information encoded into different edge types;

The model separately considers edges relating to where a variable was defined, where a variable was next used, and the union of all edge types in the AST.

• How the model is taught to pay attention to different edges;

Each type of edge (listed above) has a learnable self-attention mechanism associated with it. Each node's vector representation is fed through a linear layer, and its neighbours are fed through a different linear layer. Each pair are then concatenated and multiplied by a learnable vector. The scalar values resulting for each of a nodes neighbours are fed through a SoftMax function and used as weights in a weighted sum of the neighbouring vectors. This process is performed separately for each adjacency matrix for the edge types, and the different vectors again concatenated and fed through a linear layer.

• How multiple edge types are processed in parallel

Due to the nature of GPU programming, each 'attention head' can be processed in parallel.

• How strings, for example variable names, are encoded.

Strings are encoded as follows: Firstly, each ASCII character and a padding character are given a learnable embedding vector. Thus, each character in a string can be one hot encoded, and then fed through this embedding layer to give a sequence of dense vectors. Strings are truncated to 30 characters, and those shorter than 30 characters padded with a PAD token. These are then processed with a stack of 1D convolutional layers with max pooling, to produce a single vector representation of the string.

The abstract syntax tree (AST) of a program may be augmented with the following:

1) Scope information, which can be encoded as edges between nodes in the graph; and/or

2) Edges describing the specific relationship between nodes. For example, a type of edge may be added for the relationship between a node and its right child, and another for its left child. String information may also be retained in the nodes that have it, and further distinguish between different kinds of string. For example, a variable name and the string value of a variable are different.

For the creation of the machine learning model used in at least one embodiment of the arrangement disclosed herein, a graph is defined comprising multiple edge types as

G = (A,X) where

A _{e R}e x N x N comprises a stack of adjacency matrices, each e corresponding to the adjacency matrix of a specific edge type, and

X e R^WxF are the node features. Each node, / eN, at each time-step t, has a vector representation:

hf

At time t = 0, the following is produced from X,.

S<”’ which is the feature vector of the node. The f layer in the model, transforms each hf into

The set of node vectors, at a given time-step, is = ... X

Thus, a layer of the model computes f/A,h^) = h^t+i

While the dimensionality of node vectors at (t + 1) does not have to equal that at step (t), in this particular example it does. This model builds on two recent approaches in neural networks applied to graphs.

A modified version of the Graph Attention Network is used to pass messages between neighbours in each edge type’s adjacency matrix. The representation resulting from each edge are then concatenated, and passed through a feed-forward layer. A recurrent model is then used to control how each node-vector resulting from message passing is propagated to the next layer. The type of each node is then predicted by feeding the final vector for each node through a feed-forward layer and applying a SoftMax function.

exp (leakyReLU (a^T oX = ---------τ------------------------------T(LeakyReLU (a¹

Equation (1)

Equation (1) shows how messages may be passed between neighbouring nodes. The || operator is concatenation, and a e ]R^2F. Each node’s vector is passed through a linear layer which differs depending on whether the node is the original or a neighbour. Each pair is concatenated and multiplied by the vector a, before being passed through a leaky rectified linear unit, or LeakyReLU. These scalar values are then normalised by a SoftMax function. The values are only computed for non-zero entries in the adjacency matrix. Computing a for all valid /, j results in the attention matrix, A® The matrix multiplication performs a learnable weighted average based on connections in the adjacency matrix.

As in some embodiments multiple adjacency matrices corresponding to different edge types are considered, attention may be applied to each of these adjacency matrices independently, and the result concatenated. This is described in Equation (2) below.

e,(t) x, = σ (Ο χ^=σ W_e || \ e = 0

Equation (2)

These vectors are then fed into a gating mechanism. It was found that even with skipconnections and layer normalization, models without gates, such as Graph Attention Networks and Graph Convolutional Networks, were difficult to train with multiple layers. Hence the Gated Recurrent Unit (GRU) architecture was modified, though experiments with long short-term memory (LSTM) gating yielded similar results.

r£ = σ (w_irx£ + W_crc₍ ⁽ⁱ⁾ + br) z£ = σ (wizx£ + Wczcf° + bz) n(⁽ⁱ⁾ = tank (winx^ + η^(ί)Ο (whnx[^t} + bcn hCt+i) = (i_ _{+ z}£q _c&

Equation (3)

Finally, classification is performed per-node by a feed-forward layer followed by a SoftMax function:

Pt = Softmax(w_out/i₍ ⁽ⁱ⁺¹⁾)

Equation (4)

In one embodiment, the machine learning model must first compute the initial vector representations for each node, h[°\ This may be performed by combining the properties of one or more nodes, for example: node type, value type, and/or value information. Each property is encoded as a 0 or 1, and then fed through a linear layer. The node’s type is onehot encoded, and fed through a linear layer.

Every value that a node can take can be described as a series of characters. An embedding layer is created for each ASCII character, as well as for an unknown token and a padding token. Reviewing the distribution of number of characters for each node in the training set, a sequence of length 30 was created for each node, and those sequences which are less than 30 characters are right padded. Each sequence is embedded, and subsequently summarised to produce a single vector. Each of these components is then concatenated and fed through a linear layer to produce an initial node vector used by the model.

hi ~ Winit(yprop,i\\Vnode,i\\^vvalue,i)

Equation (5)

The arrangement may be provided separately or integrated into any existing code-writing arrangement. It can run at any point where the AST of a program can be computed, so anywhere that a linter can run. A linter, also referred to as lint, refers to one or more tools operable to analyse computer code to flag programming errors. Such errors may comprise one or more of stylistic errors, bugs, and/or suspicious constructs. This could be in an integrated development environment (IDE), at the commit level, during a pull request, or anywhere in between.

Compared to conventional datasets used in the context of machine learning for code, the arrangement disclosed herein comprises a greater “density”, as in at least one embodiment almost every node in the code graph is annotated by a type annotation. The framework used to produce the data may also be generic and hence could be extended to track a range of information optionally comprising the inspection of the value of different variables, instead of their type.

One embodiment of the arrangement is disclosed below, specifically showing an example where inferring and/or guessing types is conducive to productivity. The following code extract contains a bug:

function foo() { job() . then (function (data) { doSomething(data); });

}

The bug here is that the developer is using a “promise” (the result of “job()”) to chain different elements of work but they are forgetting to return this promise. This can be problematic because it makes it impossible to latter chain other elements of work after this one.

A more correct version of the code would be:

function foo() { return job () . then (function(data) { doSomething(data) ; }); }

Conventional tools cannot provide any feedback because they do not know that “job()” is a promise. The typing model as disclosed herein would by contrast be able to infer this.

It is understood that the use of the term “execution” is a general term referring to the behaviour of a computer program when executed.

It is understood that the arrangement disclosed herein could further be used not just to identify and repair issues in computer code, but also to generate code itself. The arrangement can comprise teaching a computer to understand what code actually does during an execution phase, and that may be an important building block when teaching a machine to generate code. For example, if Reinforcement Learning were being used, one of the reinforcement loops could comprise the step of penalising code that does not perform its intended function, such as code that does not return the correct type.

Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure.

Any feature in one aspect may be applied to other aspects, in any appropriate combination. In particular, method aspects may be applied to system aspects, and vice versa. Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in 5 any other aspect, in any appropriate combination.

It should also be appreciated that particular combinations of the various features described and defined in any aspects can be implemented and/or supplied and/or used independently.

Claims

CLAIMS:

1. A method of training a machine learning model for predicting the execution of computer code, the method comprising the steps of:

receiving input computer code;

applying one or more tracers to the computer code;

collecting runtime information in relation to the one or more tracers; and generating a machine learning model based on the runtime information.

2. The method of claim 1, wherein the one or more tracers relates to data regarding one or more variables.

3. The method of claim 2, wherein the one or more variables comprise one or more of: a number; a string of characters; an array; a function and/or an argument.

4. The method of claim 3, wherein the one or more functions comprise one or more function return types.

5. The method of claim 3, wherein the one or more arguments comprise one or more argument types.

6. The method of any preceding claim, wherein runtime information in relation to the one or more tracers comprises data regarding one or more of: internal contradictions and/or inconsistencies; errors; variable mismatches; and/or variable discrepancies.

7. The method of any preceding claim, wherein the machine learning model comprises a neural network.

8. The method of any preceding claim, wherein the machine learning model is a classifier.

9. The method of any preceding claim, wherein the machine learning model is operable to annotate code with metadata in relation to one or more of: the nature and/or runtime value of a component.

10. The method of any preceding claim, wherein the input comprises a labelled dataset.

11. The method of any preceding claim, further comprising the step of:

outputting a compiled set of data in relation to the predicted execution of the input computer code.

12. A computer implemented method of predicting the execution of computer code comprising the application of a machine learning model trained according to the method of any preceding claim.

13. A computer implemented method of predicting the execution of computer code, the method comprising the steps of:

receiving input computer code;

detecting one or more elements of runtime data in relation to the input computer code;

predicting the execution of the input computer code using the one or more elements of runtime data in combination with a trained machine learning model;

outputting one or more predictions based on an output of the trained machine learning model.

14. The method of claim 13, wherein the output comprises an analysis of predicted execution of the input computer code.

15. The method of any one of claims 13 to 14, further comprising the steps of:

representing the input computer code in a graphical format comprising one or more nodes; and labelling the one or more nodes in the graphical representation.

16. The method of claim 15, wherein the labelling of the one or more nodes comprises labelling in relation to one or more items of runtime data.

17. The method of any one of claims 13 to 16, wherein the trained machine learning model comprises a machine model trained according to the method of any one of claims 1 to 11.

18. A computer code prediction tool, comprising:

an input module operable to receive an input comprising computer code; a processor comprising a trained machine learning model operable to:

detect one or more elements of runtime data in relation to the input; and predict the execution of the input using the one or more elements of runtime data in combination with the trained machine learning model; and an output module operable to generate an output comprising one or more predictions in relation to the input.

19. A system for predicting the execution of computer code of incorrect tokens in computer code, comprising:

10 detect one or more elements of runtime data in relation to the input; and predict the execution of the input using the one or more elements of runtime data in combination with the trained machine learning model; and an output module operable to generate an output comprising one or more predictions in relation to the input.