WO2020095321A2

WO2020095321A2 - Dynamic structure neural machine for solving prediction problems with uses in machine learning

Info

Publication number: WO2020095321A2
Application number: PCT/IN2019/050820
Authority: WO
Inventors: Vishwajeet Singh Thakur
Original assignee: Vishwajeet Singh Thakur
Priority date: 2018-11-06
Filing date: 2019-11-05
Publication date: 2020-05-14
Also published as: WO2020095321A8; WO2020095321A3

Abstract

This invention discloses a new and novel methodology which can be used to solve multiclass classification problems in an automated way. It describes a novel neural network architecture "Dynamic Structure Neural Network (DSNN)", a novel automated learning method "Dynamic Structure Neural Learning (DSNL)" for training DSNN models and a product "Dynamic Structure Neural Machine (DSNM)" which is a computer-implementation of DSNN and DSNL for solving multiclass classification problems, such as, Medical Diagnosis, Face Recognition, Sentiment Analysis, Speech Recognition e.t.c. The system and method given in this invention analyzes any (structured, semi-structured or unstructured) type and form of data that can be vectorized. The novelty of this method is the architecture of the DSNN model and automated learning method DSNL that simultaneously determines the number of hidden layers, number of processing units (or neurons) in each hidden layer hidden layer and their parameters (weight and biases).

Description

COMPLETE SPECIFICATION

Dynamic Structure Neural Machine for Solving Prediction Problems with uses in Machine

Learning

The following specification particularly describes the invention and the manner in which it is to be performed

l

FIELD OF INVENTION: -

This invention relates to the field of artificial intelligence. Still particularly this invention relates to that of neural networks in the field of information technology.

Particularly the present invention discloses a new and novel neural network architecture such as “Dynamic Structure Neural Network (DSNN)” for solving multiclass classification problem. Also, this invention discloses a new and novel automated method of training (or learning) Dynamic Structure Neural Networks such as“Dynamic Structure Neural Learning (DSNL)” for solving multiclass classification problem, and describes a product such as“Dynamic Structure Neural Machine (DSNM)” which is implementable in hardware for specific problems.

BACKGROUND OF INVENTION :-

Real world applications involve solving problems where an input data has to be classified as belonging to one of the many pre-defined finite number of classes, such a problem is referred to as a Classification problem in the general Computer Science community. The examples of which are e-mail classification, face recognition, cancer prediction e.t.c. There are real world applications where for each input data a real output value has to be predicted, these types of problems are called Regression problems. Some examples of regression problems are stock price prediction, credit rating in banking or insurance, market demand forecasting, e.t.c. Machine Learning techniques are commonly applied to solve real-world classification and regression problems. Neural Networks is a class of machine learning models that are successfully applied to solve many classification and regression problems. Training a neural network model would require data (which is referred to as training data) and a parameter estimation technique (also known as training or learning methods). One of the early methods to train (multi-layered) neural networks is called the back propagation method and is described in D.E. Rumelhart, G.E. Hinton and R. J. Williams, Learning representations by back-propagating errors, Nature, 323, 533—536, 1986. Most of the common (supervised learning) methods that have been proposed thus far are some variation or extension of this work. Although widely popular, back propagation method relies on the user/developer to guess the appropriate neural network architecture (i.e., number of layers and size of each layer), for which the user relies on trial-and-error method. Also back propagation based learning can give a locally optimum solution as its computation is based on gradients of some error function.

The theory of neural networks states that a feed-forward neural network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function. This is also referred to as the Universal Approximation Theorem and one of it’s first versions is proved in G. Cybenko, Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems, 2(4), 303-314, 1989. In practice, it is found that neural networks with many hidden layers, also referred to as Deep Neural Networks, tend to perform better on tasks involving large datasets. Training deep neural networks by using trial-and-error to guess the size and shape of the neural network is even more challenging task, as the number of hyperparameters on which trial-and-error is done is much more.

It is an object of the present invention to provide a new and novel neural network architecture such as“Dynamic Structure Neural Network (DSNN)” for solving multiclass classification problem. Further it is an object of the invention to provide a new and novel automated learning (training) methodology such as“Dynamic Structure Neural Learning (DSNL)” to train (or learn) DSNN models for solving multiclass classification problem.

Further it is an object of the invention to provide a new and novel Stochastic Adaptive Partitioning algorithm which is used to construct the hidden layer of a feed forward neural network, given the previous layers output on entire data set.

Furthermore it is an object of the invention to provide a product“Dynamic Structure Neural Machine (DSNM)” which is implementable in hardware for specific problems.

DESCRIPTION OF INVENTION -

Neural Networks are a class of machine learning models that can be trained to solve classification, regression and other tasks using (training) data. Traditional methods of training neural networks for any task require the size (or architecture) of the neural network to be specified by the user and that would be fixed during training / learning process. This is required as the majority of the learning methods are based on some variant of back-propagation technique. Back-propagation method (and its variants) determine the parameter values of all the layers simultaneously. All the parameter values are updated in an iterative manner using the steepest-descent logic until a convergence condition is satisfied. Hence the size of the neural network has to be decided by the user before the training / learning process starts and stays fixed there after.

This invention is about

(a) Dynamic Structure Neural Network (DSNN), a new and novel neural network architecture to solve multiclass classification problem

(b) Dynamic Structure Neural Learning (DSNL), a new and novel automated learning method to train DSNN to solve multiclass classification problems without the user having to specify any learning hyperparameters including the size of the neural network. (c) Dynamic Structure Neural Machine (DSNM), a computer-implementation of DSNN and

DSNL to solve real-world machine learning tasks of multiclass classification

Discussion on the terminology used:

I. (Feed-forward) Neural networks comprise of multiple layers arranged in a sequential order, with each layer comprising of multiple processing units (or neurons)

II. The first (hidden) layer gets the training data as it’s input and the output of each layer is then fed as input to the subsequent layer. Output of the final layer is considered as the output of the neural network for a given input

III. Each processing unit i, of layer 1, accepts as input a vector x^{1 1} and performs an affine transformation z = w,¹ * x^{1 1} + b_p where w,¹ is a vector of the same dimension as input x^{1 1} and b, is a one dimensional variable (these will be referred to as weight vector and bias term, respectively, in the subsequent discussion)

IV. This affine transformation is usually followed by a non-linear transformation y,¹ = f(z ).

The functions that are commonly used as non-linear transformation are Sigmoid or Tanh. But in practice there are many other functions that are also applied to curtail the output of a neuron to a range.

V. Here, y/, output of the non-linear transformation, is considered as the output of neuron i of layer 1

VI. Also, weight vectors (w/) and bias terms (b ) of all neurons across all layers form the parameters of the neural network, whose value is determined during the training / learning stage

VII. In vector geometry parlance, a hyperplane is represented as w * x + b = 0, where w is a vector normal to the hyperplane and b is the intercept term. Any vector x_k which is lying on the hyperplane satisfies the equation w * x_k + b = 0

VIII. We say that a point (vector) x_k lies on the positive side of the hyperplane if w * x_k + b > 0 and on the negative side of the hyperplane if w * x + b < 0 IX. The absolute value of the affine transformation z_k = w * x_k + b, where w is a unit vector, is interpreted as the perpendicular distance of vector x_k from hyperplane w * x + b = 0. And the sign of ¾ determines on which side of the hyperplane lies the point x_k

X. Therefore, the parameters of a neuron (weight vector and bias term) represent a hyperplane, where weight vector of the neuron is normal vector to the hyperplane and bias term of neuron is the intercept term of the hyperplane

XI. Absolute value of the affine transformation that the neuron performs is directly proportional to the perpendicular distance between the input to the neuron and the neurons hyperplane.

XII. The sign of the output of affine transformation of a neuron signifies if the input is lying on the positive or negative side of the hyperplane of the neuron.

XIII. In the context of multiclass classification problems, output layer comprises the same number of neurons as the number of classes.

XIV. Training neural network models usually involves splitting the input data set into three subsets, called the train, validation and test sets, in the ratio of roughly 80% train, 10% validation and 10% test. Train data set is used to estimate the neural network parameters. Validation data set is used to decide when to stop the training process, it’s also used to pick the hyper-parameter values. Test data set is used to estimate the generalization performance of the trained neural network model.

Dynamic Structure Neural Network (DSNN):

DSNN is a new and novel architecture of feedforward neural networks for solving multiclass classification problem. This neural network architecture is derived and based on the geometric significance of the role of neurons in hidden layers in achieving the mapping from input of the neural network to it’s output.

In the context of multiclass classification, each hidden layer neuron is playing a role in separating the training data set into homogenous subsets (i.e. all points in the subset belong to the same class). The hidden layers transform the training data set in the input space, which is typically not linearly separable, into a space where the set of points are becoming linearly separable (i.e. single hyperplane separates majority of points of one class from the rest of the points).

With this understanding the DSNN architecture for classification problems is proposed as:

- A neural network comprising of an input layer, one or more hidden layers and an output layer - Neurons in each hidden layer are grouped, based on their geometric orientation w.r.t. the input to the hidden layer, as either Frontier neurons or Inner neurons.

- Frontier neurons of each hidden layer are connected to the next hidden layer and the output layer.

- Therefore the output layer receives, as input, the output of Frontier neurons of all the hidden layers.

- Inner neurons of each hidden layer are only connected to the next (hidden or output) layer.

- The definition and determination of Frontier neurons and Inner neurons is explained in the following section which describes the automated training method for DSNN

FIG 1 shows the schematic of the DSNN architecture for multiclass classification problem.

Dynamic Structure Neural Learning (DSNL):

DSNL is a novel method of automated learning (training) of DSNN model to solve multiclass classification problem.

DSNL constructs each hidden layer, one after the other, from a corresponding SAP-Tree

(Stochastic Adaptive Partitioning Tree) data structure. Nodes of the SAP-Tree correspond to hyperplanes. This tree is built with the aim of partitioning the input data set to the hidden layer into smaller subsets which are homogenous (i.e. all points in the subset belong to the same class).

Each node of the SAP-Tree is converted to either a Frontier or Inner neuron of the hidden layer. Hyperplane which results in dividing a (sub) set of input data points into two subsets at least one of which is homogenous, is converted to a Frontier neuron. Remaining hyperplanes are converted into Inner neurons.

Figures 2(A), 2(B) and 2(C) explain the iterative partitioning process of the same input data set to the hidden layer into disjoint homogenous subsets, by hyperplanes which represent either Frontier neurons or Inner neurons, thereby resulting in the SAP-Tree data structure.

Specifically, FIG 2(A) shows the schematic of the processes of creation of Frontier and Inner neurons hyperplanes. FIG 2(B) shows the schematic of the process of iterative partitioning data set into disjoint homogenous subsets. FIG 2(C) show the schematic of SAP-Tree formation by iterative partitioning of data set using hyperplanes. After adding a hidden layer the algorithm decides if it’s required to add yet another hidden layer to the model. The criteria for this is if the sum total neurons in all the hidden layers is greater ‘x’% of the training data set size, then do not add any new hidden layer. In a general automated learning setting, a good estimate for the value of x derived from experimental results is 5%. However for specific data sets the value of x can be tuned further.

Next hidden layer is constructed from only those data points which do not belong to homogenous subsets of the previous and current hidden layers SAP-Tree(s).

Finally, the parameters of the output layer are determined (without error back-propagation). For a multiclass classification problem, the size of the output layer is the same as the number of classes. Cost function is Softmax/Cross-Entropy function. The input to the output layer is the output of Frontier neurons of all the hidden layers and the Inner neurons of the last hidden layer. The parameters of this output layer can be learnt directly using the gradient descent method applied on the cost function and the input data set to the output layer.

By this stage the dimension of the neural network is fully determined and fixed, hence a few iterations of backpropagation algorithm can be run to fine-tune the parameters (weights and biases) of the DSNN model. This is an optional step and usually results in slight improvement in the accuracy of the DSNN model.

Algorithm: DSNLiD)

D = input training data set

1. Diter = D

2. N = DSNN model

3. Build a SAP-Tree data structure for the current hidden layer using Diter as its input data set

4. Convert the SAP-Tree data structure into hidden layer H with Frontier and Inner Neurons, and add to N

5. Decide if a another hidden layer is required, if YES:

(a) Dl = Subset of Diter which does not belong to homogenous subsets of SAP-Tree

(b) Diter = Current hidden layers (H) output for subset Dl

(c) Go to Step 3

6. Train the output layer using gradient descent on entire data set D (without error backpropagation) Stochastic Adaptive Partitioning algorithm:

Stochastic Adaptive Partitioning algorithm, which is a method to automatically construct a SAP-Tree data structure of a hidden layer from the hidden layers input data set.

Stochastic Adaptive Partitioning algorithm is an iterative method, which considers a (sub) set of data points to be partitioned in each iteration into two subsets by a hyperplane. Data points lying on the positive side of the hyperplane form one subset and ones lying on the other side form the other subset. It starts the iteration with the hidden layers input data set and iterates over the smaller subsets recursively.

Algorithm: Stochastic Adaptive Parti tioningiDl

1. T = tree

2. DQueue = {D}, a priority queue data structure containing subsets of data points which are to be partitioned

3. Dcurr = Get next element from DQueue

STOP and return T if Dcurr is NULL

4. Compute the normal weight vector of the partitioning hyperplane

5. Compute the bias/intercept term of the partitioning hyperplane

6. Split Dcurr into Dpositive and Dnegative

7. Add the hyperplane computed in steps 3 and 4 to T, and Dpositive and Dnegative to the DQueue if none of the following conditions are TRUE:

(a) if Dcurr is homogenous (i.e. all points in the subset belong to the same class)

(b) Information Gain metric, evaluated on the training data set, increases if the Dcurr is partitioned further AND Information Gain metric, evaluated on union of training and validation data sets, does not increase if the Dcurr is partitioned further

(c) sum total neurons in all the hidden layers is greater‘x’% of the training data set size. In a general automated learning setting, a good estimate for the value of x derived from experimental results is 5%. However for specific data sets the value of x can be tuned further.

8. Goto Step 3

Steps 4, 5 and 6 of the algorithm are described below in detail: Let D = {(x_l y , ), (x₂, y₂), ... , (x_m, y_m)} be the input data to the hidden layer, where x, º R^d and is the class label such that

^ {1, 2, .. C} V i.

(1) Compute the normal weight vector of the partitioning hyperplane

Input to this subsection is the training data (sub)set D, which contains points belonging to C classes.

(step i) Find the class k which has most number of data points, call it the dominating class (step ii) Split the training data set D into two subsets

D_primaiy = {^Xi | ^Xi ^º D and y_{ = k}, set of points in D belonging to dominating class k D_secondaiy = [x, x_i e D and y, ¹ k} , set of points in D which do not belong to class k Let nl and n2 be the sizes of sets D_primaiy and D_secondaiy respectively.

(step iii) If the data set D is highly imbalanced, i.e., ratio n2 / (nl + n2) is less than threshold T, then STOP. T is a tuning parameter which can take values in range [0, 1).

That is if D contains either all points belonging to one class or a very large majority of points belonging to one class, then do nothing and return. For any particular application, cross-validation dataset is used to determine the ideal value of T.

In an automated learning setting, it is experimentally found that 0.975 is a good estimate of T.

(step iv) Formulate and solve an optimization problem to find the weight vector of neuron Formulate an optimization problem, parameterized by unit vector w, to achieve the following three objectives:

(a) Maximize f(w), which is the average pairwise difference between the projection of points from D_primaiy and D_secondaiy on w

(b) Minimize g(w), which is the average pairwise difference between the projection of points from D_nrim.,_n on w is minimized

(c) Minimize h(w), which is the average pairwise difference between the projection of points from D_secondaiy on w is minimized

This optimization problem has multiple objectives, that is, maximization of f(w) and minimization of g(w) and h(w). It also has a constraint that w is a unit vector.

Formulate the following multi-objective optimization problem to achieve the multiple objectives of maximizing f(w) and simultaneously minimizing g(w) and h(w):

Maximize w^T A w

Subject To 11 w| | = 1

V k

Where l₁ > 0 and l₂ > 0 are tuning parameters and cross-validation data is used to determine the ideal values of l₁ and l₂ . In an automated setting, it is experimentally figured that l₁ and l₂ start by taking a zero value and slowly increase their value from 0 to 1 as the network size grows.

The term (w^T A1 w), represents the average pairwise difference between the projection of pair of points from D_primaiy and D_secondaiy respectively onto w. This quantity has to be maximized.

The term (w^T A2 w), represents the average pairwise difference between the projection of pair of points from D_primaiy onto w. This quantity has to be minimized

The term (w^T A3 w), represents the average pairwise difference between the projection of pair of points from D_secondaiy onto w. This quantity has to be minimized

The solution w* of this problem is the leading eigenvector of symmetric matrix A, i.e., the eigenvector corresponding to the largest eigenvalue.

Stochastic Approximation of the Optimization Problem:

In the above formulation, if the number of samples (nl and n2) increase, then the amount of computation, which is proportional to (nl*n2 + nl *nl + n2*n2), also increases. To make the amount of computation linearly proportional to the number of samples (especially in the case of large datasets), a randomized approximation of the optimization problem stated above can be formulated by approximating matrix A with a matrix A_approx ^A _approx ^ls defined as: A approx = L 1 approx - 1, A2 a, pprox . 12 A3 approx

Where

Where n is the sum of nl and n2. This loop iterates from k=l to k=n and during every iteration of this loop the indices of elements of D_primaiy and D_secondaiy, i.e., i and j, are randomly selected from i ^ [1, nl] and j e [1, n2] respectively

The term (w^T Al_approx w) represents the random approximation of the average pairwise difference between the projection of pair of points from D_primaiy and D_secondaiy respectively onto w

This loop iterates from k=l to k=nl and during every iteration of this loop the indices of elements of D_primaiy, i.e., i and j, are randomly selected from i º [1, nl] and j ^ [1, nl]

The term (w^T A2_approx w) represents the random approximation of the average pairwise difference between the projection of pair of points from D _ri

onto w

This loop iterates from k=l to k=n2 and during every iteration of this loop the indices of elements of D_secondaiy, i.e., i and j, are randomly selected from i º [1, n2] and j e [1, n2]

The term (w^T A3_approx w) represents the random approximation of the average pairwise difference between the projection of pair of points from D_secondaiy onto w

The optimization problem is now transformed into the following form:

Maximize w^T A_approx w

Subject To 11 w| | = 1 The solution w* of this problem is the leading eigenvector of symmetric matrix A_approx, i.e., the eigenvector corresponding to the largest eigenvalue. We refer to this solution w* as the Stochastic Linear Discriminant.

The resulting vector w* becomes the weight vector of the next neuron of the hidden layer.

(2) Compute the bias term of the partitioning hyperplane

The heuristic used to decide the value of bias term is to choose a value which improves the overall classification accuracy using minimal number of separating hyperplanes. One approach to implement this heuristic is to choose the bias term which results in the least misclassification error. There are other approaches which are defined in the field of Information Theory, namely Cross-Entropy/Information-Gain and Gini Index (or Gini Impurity). In practice, any of the following metrics can be used to choose the bias term value:

(a) Misclassification error

(b) Cross Entropy

(c) Gini Index

Below are the steps used to compute the bias term value using the Gini Index metric:

Input to this subsection is the training data (sub)set D and w* which is the weight vector of the neuron computed in the previous subsection.

(step i) Compute set D_proj which contains the projection of points in D onto vector w*

(step ii) Use gini index metric to pick a value that can best split the set D proj Gini index of set D is defined as :

Gini(

where p_k is the fraction of samples in D belonging to class k. The higher the value of gini index, the more randomness in the data set D.

Gini index can be used to pick a value v

which can be used to split D into two subsets D, and D₂ such that

The weighted gini index of the resulting subsets D, and D₂ is defined as:

Weighted_Gini(v) = (p, * Gini(D,) ) + (p₂ * Gini(D₂) )

Where p, and p₂ are the fraction of samples from set D that are present in sets D, and D₂ respectively. And Gini^) and Gini(D₂) are the gini index values of D, and D₂ respectively.

How to pick value of bias term ?

Among all values in D_pro|, pick a value, v_bias, that results in the subsets D, and D₂ having minimum weighted gini index

(step iii) Bias term of the neuron, whose weight vector is the input w*, is set as negative value of

^Vbias> ^{i e}> (-¹ * ^Vbi

(3) Split Dcurr into two subsets Dpositive and Dnegative The previous two subsections describe steps to determine the parameters (weight vector and bias term) of one partitioning. This step describes the recursive approach towards building the SAP-Tree data structure..

Divide training data (sub)set D into two subsets D_positive and D_negative based on whether the points in D lie on the positive or negative side of the hyperplane of neuron whose weight vector and bias term are computed in subsections (a) and (b).

The Dynamic Structure Neural Machine (DSNM), which is an implementation of DSNN and

DSNL, can be applied in a variety of applications, such as Medical Diagnosis system, Face Recognition system, Sentiment Analysis of Social Media content, Speech Recognition system, e.t.c. The hardware implementation allows it to receive a variety of input data, such as Camera based sensor data, data stored in an external storage medium which can be connected via USB port, Stream (Web, Audio and Video) data. The vectorization module is capable of processing these types of data and converting it into vector form.

FIG 4 shows the hardware implementation of the Dynamic Structure Neural Machine (DSNM). The DSNM consists of a ARM processor based Central Processing Unit, one or more Sensors which feed the data to DSNM, a User-Input mechanism using which user send control signals to

DSNM, a storage mechanism to store and retrieve data, model and other meta information. The central element of DSNM is the DSNM process which receives control signals from user and performs data acquisition, model training or live testing. The DSNM operates in three modes (i) Data acquisition mode (ii) the training or learning mode and (iii) testing or live usage mode. In the data acquisition mode, user directs the DSNM process to receive data from a sensor or input device(s) and to store it in the internal storage mechanism. In the training or learning mode, user directs DSNM to train a neural network model for a particular data set. DSNM follows the training procedure described in the“DESCRIPTION” section to train a neural network using the training data set. Once the training is done the trained model is stored using the internal storage mechanism. The testing or live usage mode allows the user to apply an already trained model on test or live data. In this case, the sensor or input device(s) are directed to receive the data, which is then converted into vector form and passed as input to the pre-trained model, which predicts the output value for that particular input. The output or predicted value is then transmitted to the output device(s)

When applied on real-world multiclass classification data sets, DSNM has proven to give high accuracy across data sets from various domains. The examples of classification data sets on which DSNM is applied and the neural network models were trained in an automated manner are:

(i) Handwritten digits recognition

This is a publicly available data set, called MNIST. Contains 60,000 train and 10,000 test images of ten handwritten digits, which are of 28x28 dimension. DSNM gave an accuracy of 98.4% on the test dataset.

(ii) Speech commands recognition

This is the Google Speech commands data set, comprising the audio recordings of 30 different types of speech commands. DSNM gave an accuracy of 96.9% on the test set

(iii) Exotic Particles Search

This is the SUSY data set available on UCI Machine Learning Repository. Here the problem is to distinguish between a signal process which produces supersymmetric particles and a background process which does not. It contains 4.5 million train data points and 0.5 million test data points. DSNM gave an accuracy of 80.1% on this data set. (iv) Predictive Maintenance

This is the IDA-2016 challenge data set, present in the UCI Machine Learning repository. The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurised air that are utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS.

DSNM gave an accuracy of 99.2% on this data set.

(v) Bank Marketing data set

This data set is available on the UCI Machine Learning repository. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). DSNM gave an accuracy of 88.6% on this data set

Claims

CLAIMS: What is claimed is:

1. A system and method with Artificial Intelligence (AI), Machine Learning comprising:

A computer-implementable Neural Network architecture, referred to as Dynamic Structure Neural Network (DSNN), to perform machine learning task of multiclass classification.

The neural network architecture comprises of an input layer, one or more hidden layers and an output layer. Neurons in each hidden layer are grouped as either Frontier neurons or Inner neurons.

Frontier neurons of each hidden layer are connected to the next hidden layer and the output layer. Therefore the output layer receives, as input, the output of Frontier neurons of all the hidden layers

Inner neurons of each hidden layer are only connected to the next (hidden or output) layer.

2. The system and method of claim 1, further comprising:

A method of automated machine learning, referred to as Dynamic Structure Neural Learning (DSNL), which takes as input a multiclass classification data set (or training data) and outputs a DSNN model whose architecture mentioned in claim 1

For a given multiclass classification data set, the DSNL automated learning process determines:

- number of hidden layers in the neural network model

- number of frontier and inner neurons in each hidden layer

- parameters (weights and biases) of all the neurons of all the hidden layers

- the size and parameters of the output layer

Each hidden layer is constructed, one after the other, from a corresponding SAP-Tree (Stochastic Adaptive Partitioning Tree) data structure. Nodes of the SAP-Tree correspond to hyperplanes. This tree is built with the aim of partitioning the input data set to the hidden layer into smaller subsets which are homogenous (i.e. all points in the subset belong to the same class).

After adding a hidden layer the algorithm decides if it’s required to add yet another hidden layer to the model. Next hidden layer is constructed from only those data points which do not belong to homogenous subsets of the (previous and) current hidden layers SAP-Tree(s).

Finally, the parameters of the output layer are determined using gradient descent method (without error back-propagation).

3. The system and method of claim 2, further comprising:

Partitioning a (sub) set containing data points of two or more classes is done by adaptively by splitting these points into two groups, a primary group and a secondary group. Primary group comprises of data points belonging to the class having maximum data points in the (sub) set to be partitioned in the current iteration. All the other data points of the (sub) set are part of the secondary group.

An optimization problem is proposed to find a unit vector on which two groups of data points project in a way that creates maximum separation in between the data points belonging to different groups and (to a certain extent) minimum separation between the points belonging to the same group. A stochastic approximation of this optimization problem is solved and the result is the normal vector of the partitioning hyperplane. Gini index metric is used to find the bias (or intercept term) of the partitioning hyperplane.

A subset which satisfies one of the following two criteria is not divided further:

(a) if the subset is homogenous (i.e. all points in the subset belong to the same class)

(b) if the Information Gain metric, evaluated on the training data set, increases if the (sub) set is partitioned further and Information Gain metric, evaluated on union of training and validation data sets, does not increase if the (sub) set is partitioned further

4. The system and method of claim 1, further comprising:

The computer-implementation of neural network architecture and the automated learning method, referred to as Dynamic Structure Neural Machine (DSNM), which is applicable to real-world Machine Learning tasks, resulting in high accuracy while solving several particular problems such as (i) Handwritten digits recognition (MNIST data set), (ii) Speech commands recognition (Google Speech data set), (iii) Exotic Particles Search (SUSY data set) (iv) Predictive Maintenance (IDA-2016 data set) and (v) Bank Marketing data set; in comparison with similar automated machine learning methods.