WO2019180314A1

WO2019180314A1 - Artificial neural networks

Info

Publication number: WO2019180314A1
Application number: PCT/FI2019/050220
Authority: WO
Inventors: Caglar AYTEKIN; Francesco Cricri
Original assignee: Nokia Technologies Oy
Priority date: 2018-03-20
Filing date: 2019-03-14
Publication date: 2019-09-26
Also published as: GB2572164A; GB201804451D0

Abstract

An apparatus is disclosed, comprising means for providing a plurality of neural network instances (81, 82) from a baseline neural network (80). The apparatus may further comprise means for causing connection dropout in each neural network instance (81, 82), determined over a plurality of dropout iterations, to provide modified versions of the neural network instances (84, 85). The apparatus may also comprise means for causing training of each modified neural network instance (84, 85) using a respective training data set over a plurality of training epochs.

Description

Artificial Neural Networks

Technical Field

Example embodiments relate to artificial neural networks.

Background

An artificial neural network (“neural network”) is a computer system inspired by the biological neural networks in human brains. A neural network may be considered a particular kind of algorithm or architecture used in machine learning. A neural network may comprise a plurality of discrete elements called“artificial neurons” which may be connected to one another in various ways, in order that the strengths or weights of the connections may be adjusted with the aim of optimising the neural networks performance on a task in question. The artificial neurons may be organised into layers, typically an input layer, one or more hidden layers, and an output layer. The output from one layer becomes the input to the next layer and so on until the output is produced by the final layer.

Known applications of neural networks include pattern recognition, image processing and classification, merely given by way of example.

It is known to ensemble neural networks, that is by combining the output, parameters or decisions made by plural neural networks to produce an output or decision, rather than relying on a single neural network.

Summary

An aspect relates to an apparatus comprising: means for providing a plurality of neural network instances from a baseline neural network; means for causing connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances; and means for causing training of each modified neural network instance using a respective training data set over a plurality of training epochs.

The apparatus may further comprise means for providing the respective training data sets by means of dividing provided training data into diversified clusters by reducing or minimising correlation between the clusters.

The means for providing the respective training data sets may be configured to encode the provided training data with reconstruction loss, and to cluster the encoded training data. The means for providing the respective training data sets may divide the provided training data into diversified clusters using k-means clustering.

The means for causing connection dropout may be configured, in a first dropout iteration, to cause different initial dropouts for each respective neural network instance, and to perform a plurality of subsequent dropout iterations to reduce the correlation of dropouts between different neural network instances.

Each of the plurality of subsequent dropout iterations may be performed by comparing a subset of the neural network instances, assigning a penalty based on the similarity of their respective dropouts, updating the penalty based on a proposed change to the dropouts of at least one of said subset, and keeping the proposed change if the proposed change indicates a decrease in similarity.

Each of the plurality of subsequent dropout iterations may be performed by comparing pairwise combinations of the neural network instances.

The plurality of subsequent dropout iterations may be performed until a predetermined diversity condition is met. The predetermined diversity criterion may be met when all possible neural network combinations have been compared. The predetermined diversity criterion may be met when the penalty indicates a maximum diversity or minimum correlation.

The apparatus may further comprise means for configuring one or more processing devices with the modified neural network instances, the training means causing training of each modified neural network instance on the one or more processing devices using the respective training data sets over the plurality of training epochs.

The configuring and training may be performed after the predetermined diversity condition is met.

The apparatus may be a controller which is separate from the one or more processing devices on which the modified neural networks are configured and trained.

The apparatus may further comprise means for applying test data to each trained neural network instance to produce respective output data sets, and means for receiving the respective output data sets and providing a single output data set. The means for causing connection dropout may be configured to cause each neural network instance to perform its own respective dropout process, independent of the other neural network instances, wherein for each neural network instance, after a first dropout iteration, the training means may cause training of each modified neural network instance for at least one training epoch, whereafter one or more further dropout iterations and respective training epochs are performed for each updated neural network instance.

The apparatus may further comprise means for receiving the trained neural network instances and combining their trained parameters to produce a first generalised neural network.

The apparatus may further comprise means for providing the first generalised neural network as a new baseline neural network, for providing further neural network instances from the new baseline neural network, for use in one or more subsequent dropout iterations and training epochs, and wherein the receiving and combining means further produces a second generalised neural network therefrom.

The apparatus may further comprise means for applying test data to each of the plurality of generalised neural networks to produce respective output data sets, and means for receiving the respective output data sets and providing a single output data set.

Each neural network instance may be configured on one or more processing devices and wherein the receiving and combining means is a controller, separate from the one or more processing devices.

Another aspect provides a method comprising: providing a plurality of neural network instances from a baseline neural network; causing connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances; and causing training of each modified neural network instance using a respective training data set over a plurality of training epochs.

The method may further comprise providing the respective training data sets by means of dividing provided training data into diversified clusters by reducing or minimising correlation between the clusters.

Providing the respective training data sets may comprise encoding the provided training data with reconstruction loss, and clustering the encoded training data. Providing the respective training data sets may comprise dividing the provided training data into diversified clusters using k-means clustering.

Causing connection dropout may comprise, in a first dropout iteration, causing different initial dropouts for each respective neural network instance, and causing performance of a plurality of subsequent dropout iterations to reduce the correlation of dropouts between different neural network instances.

Each of the plurality of subsequent dropout iterations may be performed by comparing a subset of the neural network instances, assigning a penalty based on the similarity of their respective dropouts, updating the penalty based on a proposed change to the dropouts of at least one of said subset, and keeping the proposed change if the proposed change indicates a decrease in similarity. Each of the plurality of subsequent dropout iterations may be performed by comparing pairwise combinations of the neural network instances.

The plurality of subsequent dropout iterations may be performed until a predetermined diversity condition is met.

The predetermined diversity criterion may be met when all possible neural network combinations have been compared.

The predetermined diversity criterion may be when the penalty indicates a maximum diversity or minimum correlation.

The method may further comprise configuring one or more processing devices with the modified neural network instances, and causing training of each modified neural network instance on the one or more processing devices using the respective training data sets over the plurality of training epochs.

The configuring and training may be performed after the predetermined diversity condition is met. The method may be performed at a controller which is separate from the one or more processing devices on which the modified neural networks are configured and trained. The method may further comprise applying test data to each trained neural network instance to produce respective output data sets, and receiving the respective output data sets and providing a single output data set.

Causing connection dropout may comprise causing each neural network instance to perform its own respective dropout process, independent of the other neural network instances, wherein for each neural network instance, after a first dropout iteration, the training means causes training of each modified neural network instance for at least one training epoch, whereafter one or more further dropout iterations and respective training epochs may be performed for each updated neural network instance.

The method may further comprise receiving the trained neural network instances and combining their trained parameters to produce a first generalised neural network.

The method may further comprise providing the first generalised neural network as a new baseline neural network, providing further neural network instances from the new baseline neural network, for use in one or more subsequent dropout iterations and training epochs, and wherein the receiving and combining may further produce a second generalised neural network therefrom.

The method may further comprise applying test data to each of the plurality of generalised neural networks to produce respective output data sets, receiving the respective output data sets and providing a single output data set.

Each neural network instance may be configured on one or more processing devices and wherein the receiving and combining may be performed by a controller, separate from the one or more processing devices.

Another aspect discloses a computer program comprising instructions for causing an apparatus to perform at least the following: providing a plurality of neural network instances from a baseline neural network; causing connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances; and causing training of each modified neural network instance using a respective training data set over a plurality of training epochs. Optional features of the computer program may comprise any previous method feature. Another aspect provides a non-transitory computer-readable storage medium having stored thereon computer-readable code, which, when executed by at least one processor, causes the at least one processor to perform a method, comprising: providing a plurality of neural network instances from a baseline neural network; causing connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances; and causing training of each modified neural network instance using a respective training data set over a plurality of training epochs.

Another aspect provides an apparatus, the apparatus having at least one processor and at least one memoiy having computer-readable code stored thereon which when executed controls the at least one processor: to provide a plurality of neural network instances from a baseline neural network; to cause connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances; and to cause training of each modified neural network instance using a respective training data set over a plurality of training epochs.

Reference to“means” herein may refer to an apparatus, processor, controller or similar hardware for performing the stated operation or operations. The processor, controller or similar hardware may have at least one associated memory having computer-readable code stored thereon which when executed controls the processor, controller or similar hardware to perform the stated operation or operations. The stated operation or operations may also or alternatively be performed using firmware or one or more electrical or electronic circuits.

Brief Description of the Drawings

Example embodiments will now be described with reference to the accompanying drawings, in which:

FIG. l is a graphical representation of an example neural network architecture;

FIG. 2 is a computer system comprising a plurality of networked devices and a control apparatus according to some example embodiments;

FIG. 3 is a schematic diagram of components of the FIG. 2 control apparatus according to some example embodiments;

FIG. 4 is a flow diagram showing example operations performed at the FIG.2 control apparatus for diversifying training data, according to some example embodiments; FIG. 5 is a flow diagram showing example operations performed at the FIG.2 control apparatus for diversifying neural network architectures, according to some example embodiments;

FIG. 6 is a flow diagram showing other example operations performed at the FIG.2 control apparatus for diversifying neural network architectures, according to some example embodiments;

FIG. 7 is a flow diagram showing other example operations performed at one of the FIG. 2 devices for diversifying neural network architectures, according to some example

embodiments;

FIG. 8 is a schematic diagram which is useful for understanding the FIGS. 6 and 7 operations; and

FIG. 9 is a flow diagram showing example operations performed at a plurality of the FIG. 2 devices for diversifying and training neural network architectures according to some example embodiments;

Detailed Description

In the description and drawings, like reference numerals refer to like elements throughout.

An artificial neural network (“neural network”) is a computer system inspired by the biological neural networks in human brains. A neural network may be considered a particular kind of computational graph, model or architecture used in machine learning. A neural network may comprise a plurality of discrete processing elements called“artificial neurons” which may be connected to one another in various ways. The strengths or weights of the connections may be updated, or“trained”, with the aim of optimising the neural network’s performance on a task in question. The artificial neurons may be organised into layers, typically an input layer, one or more intermediate or hidden layers, and an output layer. In a conventional neural network, the output from one layer becomes the input to the next layer, and so on, until the output is produced by the final layer. However, neural networks are not limited to this setup. In some neural networks, the output from several preceding layers are provided to a subsequent layer as an input. In some neural networks, the output of a given and/or subsequent layers are fed back as an input to a given layer.

For example, in image processing, the input layer and one or more intermediate layers close to the input layer may extract semantically low-level features, such as edges and textures. Later intermediate layers may extract higher-level features. There may be one or more intermediate layers, or a final layer, that performs a certain task on the extracted high-level features, such as classification, semantic segmentation, object detection, de-noising, style transferring, super-resolution processing and so on.

Neural networks may be termed“shallow” or“deep” which generally reflects the number of layers. A shallow neural network may in theory comprise only an input layer, an output layer. A deep neural network may comprise many hidden layers, possibly running into hundreds or thousands, depending on the complexity of the neural network.

It follows that the amount of processing and storage required for somewhat complex tasks may be very high.

Artificial neurons are sometimes referred to as“nodes” or“units”. Nodes perform processing operations, often non-linear. The strengths or weights of the connections between nodes are typically represented by numerical data and indicate the value that a signal passing through a given connection is multiplied by. There may be one or more other inputs called bias inputs.

There are a number of different architectures of neural network, some of which will be briefly mentioned here.

The term architecture (alternatively topology) refers to characteristics of the neural network, for example how many layers it comprises, the number of nodes in a layer, how the artificial neurons are connected within or between layers and may also refer to characteristics of weights and biases applied, such as how many weights or biases there are, whether they use integer precision, floating point precision etc. It defines at least part of the structure of the neural network. Learned characteristics such as the actual values of weights or biases may not form part of the architecture.

The architecture or topology may also refer to characteristics of a particular layer of the neural network, for example one or more of its type (e.g. input, intermediate or output layer, convolutional), the number of nodes in the layer, , the processing operations to be performed by each node etc. Examples of processing operations performed by nodes include evaluating activation functions (rectified linear unit or one of its variants, liner, sigmoid, hyperbolic tangent, etc.) on the input provided to the given node.

For example, a feedforward neural network (FFNN) is one where connections between nodes do not form a cycle, unlike recurrent neural networks. The feedforward neural network is perhaps the simplest type of neural network in that data or information moves in one direction, forwards from the input node or nodes, through hidden layer nodes (if any) to the one or more output nodes. There are no cycles or loops. Feedforward neural networks may be used in applications such as computer vision and speech recognition, and generally to classification applications.

For example, a convolutional neural network (CNN) differs from a conventional feedforward neural network in that convolution operations may take place to help correlate features of the input data across space and time, making such networks useful for applications such as image recognition, object detection, handwriting and speech recognition.

For example, a recurrent neural network (RNN) is an architecture that maintains some kind of state or memory from one input to the next, making it well-suited to sequential forms of data such as text, speech and video. In other words, the output for a given input depends not just on the current input but also on previous inputs.

Example embodiments to be described herein may be applied to any form of neural network or learning model, for any application or task, although examples are focussed on

feedforward neural networks.

When the architecture of a neural network is initialised, the neural network may operate in two phases, namely a training phase and an inference phase.

Initialised, initialisation or implementing, refers to setting up of at least part of the neural network architecture on one or more devices, and may comprise providing initialisation data to the devices prior to commencement of the training and/or inference phases. This may comprise reserving memory and/or processing resources at the particular device for the one or more layers, and may for example allocate resources for individual nodes, store data representing weights, and storing data representing other characteristics, such as where the output data from one layer is to be provided after execution. Initialisation may be

incorporated as part of the training phase in some embodiments. Some aspect of the initialisation may be performed autonomously at one or more devices in some embodiments.

In the training phase, the values of the weights in the network are determined. Initially, random weights may be selected or, alternatively, the weights may take values from a previously-trained neural network as the initial values. Training may involve supervised or unsupervised learning. Supervised learning involves providing both input and desired output data, and the neural network then processes the inputs, compares the resulting outputs against the desired outputs, and propagates the resulting errors back through the neural network causing the weights to be adjusted with a view to minimising the errors iteratively. When an appropriate set of weights are determined, the neural network is considered trained. Unsupervised, or adaptive training, involves providing input data but not output data. It is for the neural network itself to adapt the weights according to one or more algorithms. However, described embodiments are not limited by the specific training approach or algorithm used.

Once trained, the inference phase uses the trained neural network, with the weights determined during the training stage, to perform a task and generate output. For example, a task may be classification of an input image into one or more categories of images. Another task may consist of filling in missing part in an image.

In some example embodiments, a neural network architecture may be implemented by means of using one, or multiple devices interconnected by means of a communications network. If multiple devices are used, the resources required for potentially complex neural network architectures can be distributed across plural devices, utilising their respective processing and memory resources, rather than employing a single computer system which may require significant processing and memory resources. These may be termed training devices. In some example embodiments, a control apparatus, e.g. a server, may perform some functions in addition to the training devices.

A device as defined herein is any physical apparatus having its own processing and storage capability. The different devices may be co-located or physically separate. One or more of the devices may be remote. The communication network may be any form of data

communication network, for example a local area network (LAN), a wide area network (WAN), a data bus, or a peer-to-peer (P2P) network. A combination of the above network forms may be used. The data communications network may be established using one or both of wireless or wired channels or media, and may use protocols such as Ethernet, Bluetooth or WiFi, which are given merely by way of example.

Example embodiments herein relate to preparing or configuring neural networks for training, and in some cases also comprise training and applying test data for inference, i.e. to perform one or more tasks based on the trained neural networks.

Neural network ensembling is the process of combining multiple neural networks or models to produce a result, rather than using just one neural network or model. There may be different types of combination methods, such as majority voting, average, weighted average where weights may be confidence estimates, median, etc. Frequently, an ensemble of neural networks produces a more reliable result. This generally requires diversity among the neural networks. However, increasing the number of models increases computational complexity and the memoiy resources required. Dropout, or connection dropout, is a process that may be used to simulate or approximate neural network ensembling. Dropout is usually used at training stage to obtain a better generalization performance of the neural network, but it may also be used at inference stage for example in order to obtain an estimate of the uncertainty of the neural network. Dropout involves taking a neural network architecture and randomly removing one or more connections for a single training iteration. This may be achieved by multiplying a weight or connection value by zero, or a relatively small number. Therefore, dropout is sometimes described with regard to removing weights rather than connections.

For the purposes of the following, the two are considered equivalent. Another random set of weights may then be removed and another training epoch performed. At the end of each training epoch, the weights are shared. That is, the updated weights not removed at the beginning of the training epoch have the same value at the beginning of the next training epoch, assuming they are not removed in the next training epoch. Dropout approximates to training a plurality of neural networks with shared weights at the same time in a single training iteration. In inference or testing, activations of each weight that had a probability of being removed during training are multiplied with the probability of not being removed. This approximates to averaging many models’ activations.

An epoch, or training epoch, refers to the passing of all training data (for the particular neural network) through the neural network. This may involve one forwards and backwards pass is some embodiments, and it may happen in multiple iterations where at each iteration a subset of batch of training samples is passed through the neural network.

There is a low probability that random dropout will result in two identical neural network architectures. Hence a dropout process typically involves training each neural network instance for one training iteration, before performing the next dropout, and so on. As the weights are shared after each training iteration, dropout only explores minor variations of a given neural network architecture. Furthermore, diversity is constrained to subgraphs of the neural network architecture only. Dropout is also known to converge much slower, and hence it is not as computationally efficient as training a neural network without dropout.

Embodiments here provide improvements in terms of diversity and computational efficiency. Embodiments provide an apparatus and method which comprises providing a plurality of neural network instances from a baseline neural network, causing connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances, and causing training of each modified neural network instance using a respective training data set over a plurality of training epochs.

Different example embodiments are described. The embodiments may, but do not necessarily, employ a central entity such as a controller apparatus or server to ensure diversity among different neural network instances.

A neural network instance is a data representation of a baseline (or reference) neural network architecture. In some embodiments, a neural network instance of a fixed architecture may simply comprise a list of those parameters that may vary, such as whether or not connections between a fixed number of nodes are present and/or their respective weights.

Embodiments generally aim to provide diversified neural network architectures and/or diversified training data, with the aim of offering improvements over existing apparatuses and methods for ensembling and/or dropout.

FIG. l is an example neural network architecture to, comprising a plurality of nodes n, each for performing a respective processing operation. The neural network architecture to comprises an input layer 12, one or more intermediate layers 14 and an output layer 16. The interconnections 17 between nodes 11 of different layers may have associated weights. The weights may be set in an implementation or initialization phase, varied during the training phase of the neural network, and may remain set during the inference phase of the neural network until a new update of the weights is performed.

FIG. 2 is an example computer system 20 on which the FIG. 1 neural network architecture 10 may be implemented. The computer system 20 comprises a network 21, a neural network control apparatus 22 and first to fifth devices 23 - 27. The network 21 may be any form of data network, for example a local area network (LAN), a wide area network (WAN), the Internet, a peer-to-peer (P2P) network, and may use wired or wireless communications as mentioned previously.

The neural network control apparatus 22 and the first to fifth devices 23 - 27 may be distinct, physically separate computer devices having their own processing and memoiy resources. In some example embodiments, the neural network control apparatus 22 may be provided as part of one of the first to fifth devices 23 - 27. In embodiments herein, it is assumed that the neural network control apparatus 22 is separate from the first to fifth devices 23 - 27.

The neural network control apparatus 22 and the first to fifth devices 23 - 27 may be any form of computer device, for example one or more of a personal computer (PC), laptop, tablet computer, smartphone, router and an Internet-of-Things (IoT) device. The processing and memory capabilities of the first to fifth devices 23 - 27 may be different, and may change during operation, for example during use for some other purpose.

In some embodiments, one or more of the first to fifth devices 23 - 27 may be portable devices. In some embodiments, one or more of the first to fifth devices 23 - 27 may be batteiy powered or self-powered by one or more energy harvesting sources, such as for examples a solar panel or a kinetic energy converter. One or more of the first to fifth devices 23 - 27 may perform the operations to be described below in parallel with other processing tasks, for example when running other applications, handling telephone calls, retrieving and displaying browser data, or performing an Internet of Things (IoT) operation.

The neural network control apparatus 22 may be considered a centralised computer apparatus that causes assignment of neural network layers to one or more networked devices, for example the networked devices 23 - 27 shown in FIG. 2.

FIG. 3 is a schematic diagram of components of the neural network control apparatus 22. However, it should be appreciated that the same or similar components may be provided in each of the first to fifth devices 23 - 27.

The neural network control apparatus 22 may have a controller 30, a memory 31 closely coupled to the controller and comprised of a RAM 33 and ROM 34 and a network interface 32. It may additionally, but not necessarily, comprise a display and hardware keys. The controller 30 may be connected to each of the other components to control operation thereof. The term memory may refer to a storage space.

The network interface 32 may be configured for connection to the network 21, e.g. a modem which may be wired or wireless. An antenna (not shown) may be provided for wireless connection, which may use WiFi, 3GPP NB-IOT, and/or Bluetooth, for example.

The memory 31 may comprise a hard disk drive (HDD) or a solid state drive (SSD). The ROM 34 of the memory 31 stores, amongst other things, an operating system 35 and may store one or more software applications 36. The RAM 33 is used by the controller 30 for the temporary storage of data. The operating system 35 may contain code which, when executed by the controller 30 in conjunction with the RAM 33, controls operation of each of the hardware components.

The controller 30 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, plural processors, or processor circuitiy.

In some example embodiments, the neural network control apparatus 22 may also be associated with external software applications. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications or data. The neural network control apparatus 22 may be in communication with the remote server device in order to utilize the software application stored there.

Example embodiments herein will now be described in greater detail. The processing operations to be described below may be performed by the one or more software applications 36 provided on the memoiy 30, or on hardware or a combination thereof.

Diversification of Training Data

The control apparatus 22 may, in some embodiments, perform diversification of training data. The training data may comprise any form or amount of training data used in conventional neural network training.

Prior to training, the control apparatus 22 may divide provided training data into diversified clusters in order to reduce or minimise similarity or correlation between the clusters. This may involve training an auto-encoder on the provided training data with reconstruction loss or any other suitable loss, and clustering the encoded training data. An auto-encoder consists of an encoder and a decoder, so the encoded training data is the output of the encoder part. This may additionally involve using k-means clustering as the clustering method. However, any other features derived from the training and any other clustering method may be used.

For example, FIG. 4 illustrates an example process. A set of provided training data 40 may be provided to an unsupervised encoder which is part of an auto encoder 41, which may produce training data representations 42 which would result into low reconstruction loss, where the reconstruction is performed by the decoder. A decoder 43 may recover the encoded data. A clustering operation 44 may perform clustering of the training data representations 42 to a certain number K of clusters, to produce K training data sets 45 - 47. Each training data set 45 - 47 is diversified according to an algorithm for subsequent provision to M respective neural network instances 48. The clustering operation 44 may use any suitable algorithm, for example a k-means clustering algorithm. The clustering algorithm used in the clustering operation 44 may be such that the partitioned K training data sets 45 - 47 are least correlated.

It may also be that the number of clusters K produced by the clustering operation 44 is high, possibly much higher than the number of M training devices available. Then, the distribution of K clusters to M training devices may be performed such that the deviation of both intra- device clusters and inter-device clusters is high. For example, let be the centroid of cluster i. Let D_k be a set of clusters to be used in device m and D contains all training devices’ clusters. Then, the inter-device deviation measure may be given as follows.

Similarly, intra-class deviation can be designed as follows

Reinforcement learning may also be used for this optimization process, whereby a global penalty may represent a negative reward to be maximized. Also, evolutionaiy approaches may be used for this optimization, using the global penalty as the cost to be minimized by using evolutionary operations.

Diversification of Model Architectures

A number of example embodiments will now be described for diversifying neural network architectures to provide a plurality of neural network instances 48. The diversified training data sets 45 - 47 may be applied to respective neural network instances 48.

In one example embodiment, diversification may be provided using rules. The control apparatus 22 may generate a number of different neural network instances 48 each having a different architecture. For example, the control apparatus 22 may be configured to provide different neural network instances 48 with different numbers of layers and/or filter sizes and/or number of nodes per layer and/or number of activations, and so on. One can define a rule-based penalty for redundant models. One example is as follows.

First, a global penalty P is defined. Then, for each pair of neural network instances 48 having same number of layers, add a penalty Pi. Then, for each pair of neural network instances 48 having same size of filters in a layer N, add a penalty P₂(iV). Then, for each pair of devices having same number of filters in layer N add a penalty

Then, for each pair of devices having same activations at layer N add a penalty P₄( ). Then, find a model architecture distribution that minimizes the global penalty P, e.g. where P = P_t + ¾(&' 0 +¾{ /) + P₄(N).

It will be appreciated that only a subset of the above rules may be used, or alternative rules used, or additional rules used to calculate P.

The operation of finding a model architecture distribution that minimises the global penalty may be achieved in a number of ways. For example, a relatively straightforward way is for the control apparatus 22 to perform a grid search across many neural network instances 48.

Another way is to perform the following. For each pair of neural network instances 48 that contribute to a penalty P_E, propose one or more changes (i.e. for Pi change the number of layers in one neural network instance). Then, if this proposal results in another penalty, determine if this penalty is smaller than the current one. If this is satisfied, make the proposed change. Repeat this for all penalties that contribute to the global penalty over a plurality of iterations until the general penalty is zero, below a threshold or no more changes are possible, i.e. all possible combinations have been exhausted.

This procedure is partially illustrated in the flow diagram of FIG. 5, which indicates processing operations that may be performed on the control apparatus 22. One operation 5.1 may comprise taking pairs of neural network instances with a list of respective penalties for each pairwise combination. An operation 5.2 takes each pairwise combination and, in an operation 5.3, proposes a modification 5.3 to the pairwise combination, for example removing one or more of a connection or weight or activation or increasing a filter size. This may be done to one of the neural network pairs, or by proposing different modifications to both. An operation 5.4 may comprise identifying any new penalties. If none result then continue with the proposed modification in operation 5.7. If there are new penalties, an operation 5.5 determines if the new penalties are worse than the previous penalties for the pairwise combination. If not, the proposed modification is performed on the one or more pairs of neural network instances in an operation 5.7 and the new penalties are added to the list whereafter operation 5.1 is returned to until the global penalty is either zero, or a predetermined lower threshold reached. If the penalties get worse, then in operation 5.6 the proposed modification to the neural network instances is not made and the next pair is evaluated by returning to operation 5.2.

It will be appreciated that some operations may be removed or replaced. Some operations may be performed in parallel. Additional operations may be added. Numbering of operations is not necessarily indicative of processing order.

The penalties P. per pairwise case i may be defined manually. There may also be some user- defined constraints on the possible modification or modifications that are performed in operation 5.3. For example, a possible constraint is that the filter size in a first layer of neural network instances cannot exceed 7x7 for computational efficiency, etc. Alternatively, or additionally, reinforcement learning can be used for this optimization process, where the global penalty P represents a negative reward to be maximized. Also, evolutionaiy

approaches may be used for this optimization, using the global penalty P as the cost to be minimized by using evolutionaiy operations.

During inference, the control apparatus 22 may send the data to be tested to each neural network instance 48 and gather the results (e.g. the predictions). Then, by means of any combination strategy (such as using known techniques e.g. majority voting for classification or averaging for regression) the final result can be determined.

Alternatively, or additionally, a device providing one or more neural network instances 48 may collect all other neural network instances from other devices, and may perform the inference itself on the multiple neural network instances. The combination of results may be performed in the same or similar manner as above.

Thus, the described embodiment permits ensembling using diverse training data and diverse architectures, which is a goal of neural network ensembling.

It should be appreciated that each above neural network instance 48 can be configured on a plurality of different training devices, e.g. in a distributed manner as indicated in FIG. 2.

Another embodiment will now be described. This embodiment may also receive clustered training data sets. For example, this embodiment may involve providing training data sets derived using the previously described, or a similar, method.

This embodiment may use a substantially fixed baseline neural network architecture, for example having the same number of activations, number of layers, filter sizes and so on. Different neural network instances may be produced from the baseline neural network by modifying only the connections using a connection dropout process. Dropout of connections may be achieved by dropping weights, for example by multiplying a current weight value by zero or a small number. Diversification of model architectures may be achieved in a relatively straightforward manner.

The connection dropout may be performed by the control apparatus 22. The control apparatus 22 may store a data representation of the baseline neural network architecture and thereafter create data representations of plural neural network instances, each having the same architecture, and iteratively performing dropout to derive modified neural network instances which are diversified. The control apparatus 22 may then configure the actual neural network instances, for example on one or more training devices which may be external to the control apparatus 22 and distributed in the manner shown in FIG. 2. In some embodiments, the plural neural network instances created by the control apparatus 22 may simply comprise the list of connections or weights for each instance, because the baseline architecture is otherwise fixed.

Referring to FIG. 6, a flow diagram is shown comprising example processing operations that may be performed by the control apparatus 22 in an example embodiment. One operation 6.1 may comprise providing plural neural network instances from a baseline neural network. Another operation 6.2 may comprise causing connection dropout in each neural network instance to provide modified versions of each neural network instance. Another operation 6.3 may comprise causing training of each modified neural network instance over a plurality of training epochs. These operations relate to the configuring and training phases performed by the control apparatus 22. In some embodiments, the control apparatus 22 may perform additional operations, including an operation 6.4 of performing inference by applying test data to each trained neural network instance. Another operation 6.5 may comprise combining the inference results from the neural network instances to provide the ensemble result. It will be appreciated that some operations may be removed or replaced. Some operations may be performed in parallel. Additional operations may be added. Numbering of operations is not necessarily indicative of processing order.

Referring to FIG. 7, a flow diagram is shown comprising example processing operations that may be performed by the control apparatus 22 in another example embodiment. One operation 7.1 may comprise providing plural neural network instances from a baseline neural network. Another operation 7.2 may comprise causing connection dropout in each neural network instance to provide modified versions of each neural network instance. Another operation 7.3 may determine if a diversity criterion has been met. If not, operation 7.2 is re- performed. If so, another operation 7.4 may comprise configuring the modified neural network instances on one or more processing devices, i.e. training devices. Another operation 7.5 may comprise causing training of each modified neural network instance over a plurality of training epochs. These operations relate to the configuring and training phases performed by the control apparatus 22. In some embodiments, the control apparatus 22 may perform additional operations, including an operation 7.6 of performing inference by applying test data to each trained neural network instance. Another operation 7.7 may comprise combining the inference results from the neural network instances to provide the ensemble result.

The operation 7.2 may comprise, for example, initialising each neural network instance with random connection dropouts. A global penalty P may be defined, which may be a function of one or more penalty factors Pi, P2 etc. based on proposed connection dropouts. For example, the global penalty P may equal P1 + P2 + P3 etc.

In an example, the global penalty P is determined based on pairwise comparisons. A first penalty Pi may penalise pairs of devices per each pair of the same connections proposed for dropping. For example, if a first neural network instance and a second neural network instance both have weights A, B and C proposed for dropping, then the penalty Pi may be two or three times a constant. For example, a P2 may penalise each pair of devices per each pair of connections removed from the same filter. For example, a P3 may penalise each pair of devices per each pair of connections removed from the same layer. Additional constraints may apply, such as the maximum number of connections that can be dropped from a filter or layer, which can be fixed beforehand.

A process similar to the FIG. 5 operations may be performed over plural dropout iterations, whereby pairwise comparison of different neural network instances is made, and for each pair, a connection dropout for one or both pairs is proposed. If the resulting global penalty P changes, to indicate less correlation, then the proposal may be made, and fed-back for the next connection dropout iteration once the other pairs have been processed. If the resulting global penalty P indicates the same, or more correlation, then the proposal is not performed.

This process is iteratively performed until either the global penalty is zero, meets a predetermined lower value, or all possible pairwise combinations or proposed changes have been completed.

The result of the process will be a plurality of neural network instances, diversified in terms of their different connections.

FIG. 8 is a graphical diagram useful for understanding the above embodiment, using a relatively simple baseline neural network 8o. From this, first and second neural network instances 8i, 82 are initialised by randomly dropping connections or weights in each instance. Each neural network instance 81, 82 then undergoes the pairwise process of proposing one or more further dropouts, determining if they become more or less diverse, and keeping the proposed dropouts if they become more diverse. This process is repeated until the diversity criterion is met. From this, first and second modified neural network instances 84, 85 are provided, in this example to first and second training devices 86, 87. Finally, the first and second modified neural network instances are trained over a plurality of epochs using respective training data clusters 88, 89.

During inference operations 6.4, 7.6, the control apparatus 22 may send the test data to each neural network instance 84, 85 on the respective devices 86, 87 and may receive the results (predictions) from each. Then, using one or more known combination strategies, such as majority voting for classification, or averaging for regression, the ensembling operations 6.5, 7.7 can decide on the final result.

Alternatively, one training device 86 may collect the neural network instance(s) 85 from the other training device(s) 87 models and execute the inference and ensembling itself on the plurality of neural network instances 84, 85. The combination may be the same as above.

Another option is to combine all neural network instances 84, 85 into a single one. This is possible because the neural network instances 84, 85 are subsets of the baseline neural network 80. This can, for example, be performed by averaging the weights across all instances 86, 87. Neural network instances 86, 87 that do not include a certain weight can be left out for averaging that weight. This option is mainly for fast inference.

This embodiment is similar to conventional dropout only in terms of the initial random removal of connections or weights for each neural network instance. However, iterative diversification of the otherwise fixed architecture to meet a diversity criterion (or criteria) over multiple iterations ensures that the instances are not too correlated. In training, a key difference in the above embodiment is that each neural network instance is trained over a relatively long time, for example for one to few hundreds epochs, but it could be more or less, whereas in conventional dropout many models are only trained for one epoch or even for only one iteration. Another key difference is that, in conventional dropout, the values of weights are shared. This makes the neural network instances that are used somewhat correlated. In the embodiments proposed herein weights are not shared. Another key difference is that, in the conventional inference operations, dropout approximates averaging of all activations in the general model across all models and it makes a single decision. In our embodiments, we either average the weights (not the activations) to make a single decision or we run each model separately and average/combine the predictions.

Another embodiment will now be described.

This embodiment may also receive clustered training data sets. For example, this

embodiment may involve providing training data sets derived using the previously described, or a similar, method.

This embodiment may also use a substantially fixed, baseline neural network architecture, for example having the same number of activations, number of layers, filter sizes and so on. Different neural network instances may be produced from the baseline neural network by modifying only the connections using a connection dropout process. Dropout of connections may be achieved by dropping weights, for example by multiplying a current weight value by zero or a small number.

In this embodiment, a conventional dropout process may be performed for each neural network instance. Each connection dropout process is performed independently of the other connection dropout processes, and therefore dropout may not be determined by the control apparatus 22 and there is no sharing of weights between the N instances for one or more training epochs. For example, this may be performed by providing each neural network instance on a respective training device, for example as shown in FIG. 2, each having its own processing and memory resources. Each training device may initialise its own neural network instance with an initial randomised dropout of one or more connections to provide a modified neural network instance. Then, a respective set of the clustered training data may be used to train each modified neural network instance for one epoch. The weights at the end of the training epoch may be shared within a neural network instance, that is they may be kept for the next dropout iteration (provided those connections are not dropped) as in the conventional dropout process. The training and dropout process may repeat within the independent neural network instances for a plurality of iterations/epochs, for example in the order of one-hundred.

This helps ensure that the modified neural network instances are less correlated, because of the independence of connection dropouts, and also better trained because the training happens over more than one epoch, and possibly for tens or about a hundred epochs.

Referring to FIG. 9, for example, a flow diagram shows example processing operations performed by both the control apparatus 22 and one or more training devices, e.g. first and second training devices 23, 24 shown in FIG. 2. It should be appreciated that all operations may be performed on a single device in other embodiments. It should also be appreciated that more than two training devices 23, 24 may be used.

A first operation 9.1 may comprise the control apparatus 22 providing to the respective training devices 23, 24 an instance of a baseline neural network, which are received in operation 9.2. Another operation 9.3, independently performed at each respective training device 23, 24, is to perform a dropout using a respective randomised connection dropout algorithm. In another operation 9.4, the control apparatus 22 provides to each training device 23, 24 a respective training data cluster. Each training device 23, 24 receives its training data and commences training for one epoch in an operation 9.5. In an operation 9.6 the trained weights are shared and operations 9.4 and 9.5 repeated for a predetermined number N of epochs, e.g. one hundred.

After N epochs, the trained neural network instances are sent by the training devices 23, 24 to the control apparatus 22 which combines in an operation 9.7 the trained data to generate a generalised neural network model. The combining may comprise searching by a linear combination of weights that gives the highest performance on a validation dataset. This may be used for inference, but this may be fed back to operation 9.1 so that the described process is repeated one or more additional times to further update the generalised neural network model. An inference operation 9.8 may be performed subsequently on the generalised model in the conventional manner, as the generalised model generalised model approximates to an ensemble of different neural networks, but with better diversification and therefore providing more accurate results during inference.

A difference of the embodiment compared with conventional dropout is that, in dropout there is weight sharing in each epoch, yet in the above case the weights are initialized to the same value once, and then, for a plurality of epochs, the weights are not shared across the neural network instances or training devices. Only after a plurality of epochs are weights shared and the training may repeat with the new neural network model. This may be important because the neural network instances have the freedom to search for parameters independent from each other. The combining of the neural network instances is also significantly different than conventional dropout.

Embodiments herein describe the following features.

First, there has been described a methodology applied in a central entity (e.g. the neural network control apparatus 22) which ensures diversity among the model architectures, and of the training data across individual training devices. An algorithm has be described, being an example of how to ensure diversity.

Further, to address a drawback of dropout, there has been described the training of distributed several neural network instances with removed weights over a relatively long period, ensuring that each neural network instance is reliable. This addresses the drawback of conventional dropout, where several neural network instances are rarely trained more than once due to the low probability of coming up with a specific neural network architecture more than once.

Further, to address another drawback of dropout, namely correlated model architectures, there has been described a method and system for controlling the randomness of removing connections. This is controlled by ensuring model diversity, prior to learning, by the central entity or neural network control apparatus 22. Examples of algorithms for satisfying a diversity criterion have been described.

Further, to address another drawback of dropout, namely correlated weight values, we do not employ weight sharing in all cases. In one embodiment, weight values are not shared at all and in another case the weight values are shared (in fact combined) after a relatively long time of individual training, i.e. after a relatively large number of training epochs.

Where a structural feature has been described, it may be replaced by means for performing one or more of the functions of the structural feature whether that function or those functions are explicitly or implicitly described.

The term apparatus may be replaced with the term device. The term neural network may be replaced with trained or learned model. The methods and apparatuses described may be implemented in hardware, software, firmware or a combination thereof.

In this brief description, reference has been made to various examples. The description of features or functions in relation to an example indicates that those features or functions are present in that example. The use of the term‘example’ or‘for example’ or‘may’ in the text denotes, whether explicitly stated or not, that such features or functions are present in at least the described example, whether described as an example or not, and that they can be, but are not necessarily, present in some of or all other examples. Thus‘example’,‘for example’ or‘may’ refers to a particular instance in a class of examples. A property of the instance can be a property of only that instance or a property of the class or a property of a sub-class of the class that includes some but not all of the instances in the class. It is therefore implicitly disclosed that a features described with reference to one example but not with reference to another example, can where possible be used in that other example but does not necessarily have to be used in that other example.

Although embodiments of the present invention have been described in the preceding paragraphs with reference to various examples, it should be appreciated that modifications to the examples given can be made without departing from the scope of the invention as claimed.

Features described in the preceding description may be used in combinations other than the combinations explicitly described.

Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not.

Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not. Whilst endeavoring in the foregoing specification to draw attention to those features of the invention believed to be of particular importance it should be understood that the Applicant claims protection in respect of any patentable feature or combination of features hereinbefore referred to and/or shown in the drawings whether or not particular emphasis has been placed thereon.

Claims

1. An apparatus comprising:

means for providing a plurality of neural network instances from a baseline neural network;

means for causing connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances;

means for causing training of each modified neural network instance using a respective training data set over a plurality of training epochs.

2. The apparatus of claim 1, further comprising means for providing the respective training data sets by means of dividing provided training data into diversified clusters by reducing or minimising correlation between the clusters.

3. The apparatus of claim 2, wherein the means for providing the respective training data sets is configured to encode the provided training data with reconstruction loss, and to cluster the encoded training data.

4. The apparatus of claim 2 or claim 3, wherein the means for providing the respective training data sets divides the provided training data into diversified clusters using k-means clustering.

5. The apparatus of any preceding claim, wherein the means for causing connection dropout is configured, in a first dropout iteration, to cause different initial dropouts for each respective neural network instance, and to perform a plurality of subsequent dropout iterations to reduce the correlation of dropouts between different neural network instances.

6. The apparatus of claim 5, wherein each of the plurality of subsequent dropout iterations is performed by comparing a subset of the neural network instances, assigning a penalty based on the similarity of their respective dropouts, updating the penalty based on a proposed change to the dropouts of at least one of said subset, and keeping the proposed change if the proposed change indicates a decrease in similarity.

7. The apparatus of claim 6, wherein each of the plurality of subsequent dropout iterations is performed by comparing pairwise combinations of the neural network instances.

8. The apparatus of any of claims 5 to 7, wherein the plurality of subsequent dropout iterations are performed until a predetermined diversity condition is met.

9. The apparatus of claim 8, wherein the predetermined diversity criterion is met when all possible neural network combinations have been compared.

10. The apparatus of claim 8, wherein the predetermined diversity criterion is when the penalty indicates a maximum diversity or minimum correlation.

11. The apparatus of any of claims 1 to 10, further comprising means for configuring one or more processing devices with the modified neural network instances, the training means causing training of each modified neural network instance on the one or more processing devices using the respective training data sets over the plurality of training epochs.

12. The apparatus of claim n, when dependent on claim 8, wherein the configuring and training is performed after the predetermined diversity condition is met.

13. The apparatus of claim 11 or claim 12, performed at a controller which is separate from the one or more processing devices on which the modified neural networks are configured and trained.

14. The apparatus of any of preceding claim, further comprising means for applying test data to each trained neural network instance to produce respective output data sets, and means for receiving the respective output data sets and providing a single output data set.

15. The apparatus of any of claims 1 to 4, wherein the means for causing connection dropout is configured to cause each neural network instance to perform its own respective dropout process, independent of the other neural network instances, wherein for each neural network instance, after a first dropout iteration, the training means causes training of each modified neural network instance for at least one training epoch, whereafter one or more further dropout iterations and respective training epochs are performed for each updated neural network instance.

16. The apparatus of claim 14, further comprising means for receiving the trained neural network instances and combining their trained parameters to produce a first generalised neural network.

17. The apparatus of claim 15, further comprising means for providing the first generalised neural network as a new baseline neural network, for providing further neural network instances from the new baseline neural network, for use in one or more subsequent dropout iterations and training epochs, and wherein the receiving and combining means further produces a second generalised neural network therefrom.

18. The apparatus of claim 17, further comprising means for applying test data to each of the plurality of generalised neural networks to produce respective output data sets, and means for receiving the respective output data sets and providing a single output data set.

19. The apparatus of any of claims 14 to 18, wherein each neural network instance is configured on one or more processing devices and wherein the receiving and combining means is a controller, separate from the one or more processing devices.

20. A method comprising:

providing a plurality of neural network instances from a baseline neural network; causing connection dropout in each neural network instance, determined over a plurality of dropout iterations, to provide modified versions of the neural network instances;

causing training of each modified neural network instance using a respective training data set over a plurality of training epochs.

21. The method of claim 20, further comprising providing the respective training data sets by means of dividing provided training data into diversified clusters by reducing or minimising correlation between the clusters.

22. The method of claim 21, wherein providing the respective training data sets comprises encoding the provided training data with reconstruction loss, and clustering the encoded training data.

23. The method of claim 21 or claim 22, wherein providing the respective training data sets comprises dividing the provided training data into diversified clusters using k-means clustering.

24. The method of any of claims 20 to 23, wherein causing connection dropout comprises, in a first dropout iteration, causing different initial dropouts for each respective neural network instance, and causing performance of a plurality of subsequent dropout iterations to reduce the correlation of dropouts between different neural network instances.

25. The method of claim 24, wherein each of the plurality of subsequent dropout iterations is performed by comparing a subset of the neural network instances, assigning a penalty based on the similarity of their respective dropouts, updating the penalty based on a proposed change to the dropouts of at least one of said subset, and keeping the proposed change if the proposed change indicates a decrease in similarity.

26. The method of claim 25, wherein each of the plurality of subsequent dropout iterations is performed by comparing pairwise combinations of the neural network instances.

27. The method of any of claims 24 to 26, wherein the plurality of subsequent dropout iterations are performed until a predetermined diversity condition is met.

28. The method of claim 27, wherein the predetermined diversity criterion is met when all possible neural network combinations have been compared.

29. The method of claim 27, wherein the predetermined diversity criterion is when the penalty indicates a maximum diversity or minimum correlation.

30. The method of any of claims 20 to 29, further comprising configuring one or more processing devices with the modified neural network instances, and causing training of each modified neural network instance on the one or more processing devices using the respective training data sets over the plurality of training epochs.

31. The method of claim 30, when dependent on any of claims 27 to 29, wherein the configuring and training is performed after the predetermined diversity condition is met.

32. The method of claim 30 or claim 31, performed at a controller which is separate from the one or more processing devices on which the modified neural networks are configured and trained.

33. The method of any of claims 20 to 32, further comprising applying test data to each trained neural network instance to produce respective output data sets, and receiving the respective output data sets and providing a single output data set.

34. The method of any of claims 20 to 33, wherein causing connection dropout comprises causing each neural network instance to perform its own respective dropout process, independent of the other neural network instances, wherein for each neural network instance, after a first dropout iteration, the training means causes training of each modified neural network instance for at least one training epoch, whereafter one or more further dropout iterations and respective training epochs are performed for each updated neural network instance.

35. The method of claim 34, further comprising receiving the trained neural network instances and combining their trained parameters to produce a first generalised neural network.

36. The method of claim 35, further comprising providing the first generalised neural network as a new baseline neural network, providing further neural network instances from the new baseline neural network, for use in one or more subsequent dropout iterations and training epochs, and wherein the receiving and combining further produces a second generalised neural network therefrom.

37. The method of claim 36, further comprising applying test data to each of the plurality of generalised neural networks to produce respective output data sets, receiving the respective output data sets and providing a single output data set.

38. The method of any of claims 35 to 37, wherein each neural network instance is configured on one or more processing devices and wherein the receiving and combining is performed by a controller, separate from the one or more processing devices.

39. A computer program comprising instructions for causing an apparatus to perform at least the following: