WO2019155214A1

WO2019155214A1 - Method and apparatus for power signal disaggragation using a convolutional neural network

Info

Publication number: WO2019155214A1
Application number: PCT/GB2019/050332
Authority: WO
Inventors: Peter Davies; Lionel WARD; Jon DENNIS
Original assignee: Green Running Limited
Priority date: 2018-02-07
Filing date: 2019-02-07
Publication date: 2019-08-15
Also published as: GB2570890A; GB201801984D0; GB2570890B

Abstract

A power consumption signal processing system for monitoring power consumption, the system comprising: an input for receiving a power consumption signal, the power consumption signal having a first sampling frequency; an output for outputting information relating to changes in state of a plurality of devices and/or components of a device corresponding to changes in the signal; and a processor configured to: for a first portion of the input signal, extract a value corresponding to each of one or more characteristics of the signal for each of a plurality of windows, each window comprising a plurality of samples, resulting in a time series of data for each of the one or more characteristics comprising the window values, the time series having a data point frequency which is lower than the first sampling frequency, wherein at least one of the characteristics corresponds to at least one frequency component of the signal; generate an event probability value for each of one or more device or component state changes using a convolutional neural network based classifier, the event probability being the probability that the first portion of the signal corresponds to the change in state of the device or component, the classifier input being obtained from the time series.

Description

METHOD AND APPARATUS FOR POWER SIGNAL DISAGGRAGATION

USING A CONVOLUTIONAL NEURAL NETWORK

FIELD

The present invention relates to a method and apparatus for monitoring electrical power consumption.

BACKGROUND

An aggregate electrical power consumption signal may be measured for a unit, such as a residential or commercial building for example. An aggregate power consumption signal comprises the total power consumption data as a function of time. It is desirable to accurately determine the power consumption for the individual electrical power consuming appliances within the building, or components within an appliance or appliances, from the aggregate signal.

BRIEF DESCRIPTION OF FIGURES

Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which:

Figures 1 (a) and (b) show example power disaggregation data;

Figure 2 is a schematic illustration of a system for processing an electrical power consumption signal, in accordance with an embodiment;

Figure 3(a) shows a flow chart illustrating a method of processing a power consumption signal in accordance with an embodiment;

Figures 3(b) and 3(c) show example CNN models which may be used to implement the method of Figure 3(a);

Figure 4(a) shows an example method of extracting one or more time series corresponding to harmonic or fundamental freqnency magnitude values; Figure 4(b) shows an example method of extracting Q values;

Figure 5 shows an example method of extracting a time series corresponding to harmonic/fundamental frequency magnitude values and/or a time series corresponding to theta values;

Figure 6 shows an example of the time series for a first portion of the signal corresponding to an onset event of a bag-less vacuum cleaner;

Figure 7 shows a flow chart illustrating a method of processing a power consumption signal in accordance with an embodiment;

Figure 8 shows an example of an extracted first portion, which comprises a captured event caused by a clothes dryer switching on;

Figure 9 shows an example of buffered time series;

Figure 10 shows the time series of Figure 9 once normalisation has been applied;

Figure 11 shows an example method of training a CNN, illustrated using a training flowchart;

Figure 12 shows a plot of training accuracy and Figure 13 shows a plot of the loss;

Figure 14 shows an example evolution of the first 8 filters in the first convolutional layer at different stages during training of an example model;

Figure 15(a) shows an example of an event that is jittered and Figure 15(b) shows an example method of data augmentation that may be used when training the CNN;

Figure 16 shows a flow chart illustrating a method using a FHMM based tracking system; Figure 17 is a method of training the FHMM based tracking model, which is performed on the system before implementation.

DETAILED DESCRIPTION

According to an embodiment, there is provided a power consumption signal processing system for monitoring power consumption, the system comprising:

an input for receiving a power consumption signal, the power consumption signal having a first sampling frequency;

an output for outputting information relating to changes in state of a plurality of devices and/or components of a device corresponding to changes in the signal; and a processor configured to:

for a first portion of the input signal, extract a value corresponding to each of one or more characteristics of the signal for each of a plurality of windows, each window comprising a plurality of samples, resulting in a time series of data for each of the one or more characteristics comprising the window values, the time series having a data point frequency which is lower than the first sampling frequency, wherein at least one of the characteristics corresponds to at least one frequency component of the signal;

generate an event probability value for each of one or more device or component state changes using a convolutional neural network based classifier, the event probability being the probability that the first portion of the signal corresponds to the change in state of the device or component, the classifier input being obtained from the time series.

In an embodiment, the power consumption signal has a fundamental frequency, and the frequency component may correspond to one of the fundamental or harmonic frequencies.

In an embodiment, the classifier input is obtained from two or more time series.

In an embodiment, at least one of the characteristics relates to the amplitude of one or more harmonic or fundamental frequency components. In an embodiment, at least one of the characteristics comprises the amplitude of a harmonic or fundamental frequency component. In an embodiment, at least one of the characteristics comprises the root mean square of two or more even harmonic components.

In an embodiment, the power consumption signal comprises a current signal and a voltage signal and wherein at least one of the characteristics corresponding to at least one of the fundamental or harmonic frequencies comprises the phase angle of the current relative to the voltage for the frequency component.

In an embodiment, the characteristics of the utility consumption signal further comprise one or more of: real power, apparent power and reactive power.

In an embodiment, the power consumption signal comprises a current signal and a voltage signal and wherein the characteristics of the utility consumption signal comprise the real power, the root mean square of the magnitude of all harmonic frequency components, and the phase angle of the current relative to the voltage at the fundamental frequency. In an embodiment, the characteristics further comprise the reactive power.

The processor may be further configured to identify the first portion of the input signal by detecting a magnitude change and identifying the first portion of the input signal as a portion comprising the detected magnitude change.

The processor may be further configured to convert the time series to a standard length, for example, by buffering or warping. The processor may be further configured to normalise the time series using parameters obtained prior to implementation, for example from training data.

In an embodiment, the processor is further configured to identify a power signature of one or more electrical devices or components based on the event probability values and monitor power consumption of the one or more electrical appliances or components.

Identifying the power signature and monitoring the power consumption may comprise using a tracking system. The tracking system may be an FHMM based tracking system or an LSTM based tracking system for example. In an embodiment, the first sampling frequency is greater than or equal to 1 kHz. In an embodiment, the first sampling frequency is greater than or equal to 1 MHz.

According to an embodiment, there is provided a method of processing a power consumption signal for monitoring power consumption, comprising:

receiving a power consumption signal, the power consumption signal having a first sampling frequency;

for a first portion of the input signal, extracting a value corresponding to each of one or more characteristics of the utility consumption signal for each of a plurality of windows, each window comprising a plurality of samples, resulting in a time series of data for each of the one or more characteristics comprising the window values, the time series having a data point frequency which is lower than the first sampling frequency, wherein at least one of the characteristics corresponds to at least one frequency component of the signal;

generating an event probability value for each of one or more device or component state changes using a convolutional neural network based classifier, the event probability being the probability that the first portion of the signal corresponds to the change in state of the device or component, the classifier input being obtained from the time series;

outputting information relating to changes in state of a plurality of devices and/or components of a device corresponding to changes in the signal.

According to an embodiment, there is provided a method of training a power consumption signal processing system to monitor power consumption, the method comprising:

obtaining a training dataset comprising a plurality of portions of power consumption signals, the power consumption signals having a first sampling frequency, each portion comprising one or more labels indicating that the portion of the signal corresponds to a change in state of the device or component;

for each portion of signal, extracting a value corresponding to each of one or more characteristics of the utility consumption signal for each of a plurality of windows, each window comprising a plurality of samples, resulting in a time series of data for each of the one or more characteristics comprising the window values, the time series having a data point frequency which is lower than the first sampling frequency, wherein at least one of the characteristics corresponds to at least one frequency component of the signal;

generating an event probability value for each of one or more device or component state changes using a convolutional neural network based classifier, the event probability being the probability that the portion of the signal corresponds to the change in state of the device or component, the classifier input being obtained from the time series;

training the classifier using the labels to improve the performance.

The method may further comprise generating further training data from the training dataset by generating modified copies of the training data.

The methods are computer-implemented methods. Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal. The carrier medium may comprise a non-transitory computer readable storage medium.

According to an embodiment, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform the method.

Devices that monitor the consumption of electrical power for a unit include, for example, smart meters. A unit may be a residential or commercial building for example. A smart meter can monitor and record electricity consumption for a building. Other devices, such a smart plugs, can record electricity consumption for one or more appliances connected to the plug. Such devices may comprise real-time or near real time sensors, and optionally outage notification means and means for quality monitoring. The information may be transferred, for example to a supplier, through a suitable communication medium, on request or on a predefined schedule for example. The infrastructure which may be used to facilitate collection of a power consumption signal may include the hardware comprised in the actual device, software such as data management software, communication means and customer associated systems for example. The monitored appliances may include any household or commercial electrical appliance, including but not limited to heat pumps, HVAC (heating, ventilation, and air conditioning), electric vehicles and white goods (fridge, freezer, washing machines, dryers, dishwashers).

Whilst a smart meter provides information regarding the power consumption for the entire unit, e.g. the entire building, it is often desirable to determine the power consumption for the individual appliances (or components within appliances) within the unit. Similarly, whilst a smart plug provides information regarding the power consumption for all connected appliances, it is often desirable to determine the power consumption for the individual connected appliances (or components of the connected appliance or appliances).

Electrical power consumption may be monitored by sampling voltage and current (v(t) and i(t) respectively). In an embodiment, an electrical power consumption signal comprises the current signal i(t) and optionally the voltage signal v(t). In an embodiment, an electrical power consumption signal additionally or alternatively comprises one or more of the real, apparent and reactive power signal generated from the current and voltage signals. In an embodiment, an electrical power consumption signal additionally or alternatively comprises a smoothed or averaged signal. The electrical power consumption signal may correspond to electrical power consumption for multiple devices (e.g. in a unit or connected to a smart plug) or for a single device having multiple components for example.

US9250101 , the contents of which are incorporated herein by reference, describes a method for monitoring the electrical power consumption of an electrical appliance from an aggregate power signal for a building. The method identifies the power consumption of the particular electrical appliance from the signal using a power load signature unique to that appliance. The method aims to separate the power consumption for each appliance from the aggregate signal. This is referred to as disaggregation. It is desirable to improve the accuracy of disaggregation. In the following, methods and systems for power disaggregation using convolutional neural networks (CNN) are described. In the described methods and systems, the input power consumption data stream is converted into a form particularly suited for CNNs.

In an embodiment, an input data stream sampled at a higher frequency is converted into one or more lower frequency time series streams. Each of these streams corresponds to different electrical characteristics. At least one of the characteristics corresponds to a frequency component, for example at least one of the fundamental or harmonic frequencies of the electrical signal.

The streams may capture different aspects of the current and voltage waveforms, for example the real and reactive power, harmonic amplitudes present in the signal, and the phase of fundamental and/or harmonic frequencies in the current relative to the voltage. By pre-processing the signal in this way, the CNN is able to learn complex hierarchical features corresponding to a multitude of structures present in the training data, and use these features to achieve improved classification performance during operation.

Figures 1 (a) and (b) show example data plotted from the REDD dataset, a publicly available power disaggregation dataset released by MIT. The data shows the real aggregate power input and ideal disaggregation output from a public dataset. The top panel shows the mains power level aggregate data, the remaining panels show sub metered power data for each individual device in the home. In this case there are two aggregated 'mains' channels rather than one, since the data comes from a US home, where two 115V mains lines are present. In the UK a single 230V mains line is present from which all devices draw their power.

Figure 2 shows a schematic illustration of a system for processing a power consumption signal 1 , in accordance with an embodiment of the present invention.

The system 1 comprises a processor 3 which takes an input power consumption signal and outputs disaggregation information. A computer program 5 is stored in non-volatile memory. The non-volatile memory is accessed by the processor and the stored code is retrieved and executed by the processor 3. The storage 7 stores data that is used by the program 5. The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input 15 for receiving the power consumption signal. The input 15 may be a receiver for receiving data from an external storage medium or a network. Alternatively, the input 15 may comprise hardware for sampling the voltage and current consumption of the unit, for example as described in US9250101. The input 15 may include an additional current input for one or more of a solar PV installation and/or a home battery storage unit.

Connected to the output module 13 is output 17. The output 17 may be an interface that displays data to the user, for example a screen. Alternatively, the output may be a transmitter for transmitting data to an external storage medium or a network.

In use, the system 1 receives data through data input 15. The program 5, executed on processor 3, outputs disaggregation information through the output 17 in the manner which will be described with reference to the following figures.

In an embodiment, the system 1 may be a metering or plug device located at the unit. Alternatively, the system 1 may be a remote system 1 , which receives electrical power consumption data transmitted from the unit. For example, the system may be implemented on a cloud computing system, which receives electrical power consumption data transmitted from a measuring device at the unit. Although in the described system, a single processor 3 located in a device is used, the system may comprise two or more processors configured to perform different parts of the processing steps. For example, the system may comprise a processor located on a local device which performs part of the processing (for example step detection), where the output data from the local device is transmitted to a remote processor (for example forming part of a cloud computing system) which performs the remaining part of the processing (for example the CNN classification).

Figure 3(a) shows a flow chart illustrating a method of processing a power consumption signal in accordance with an embodiment.

In S1 , an input power consumption signal is received. In this case the input signal comprises an input current signal i(t) and an input voltage signal v(t). The voltage and current signals comprise the data generated by sampling the current and voltage for the building. The signal is sampled at a first sampling frequency. The sampling may be performed at high frequency. In an embodiment, sampling is performed at a first sampling frequency of 1 kHz. This results in 20 samples per mains cycle (where the mains frequency is assumed to be 50 Hz). The current and voltage signals may be input in real-time.

A single appliance or multiple appliances may be connected to the monitored utility consumption line, and the signal corresponds to power consumption for the appliance or appliances connected to the monitored line. Where a single appliance is monitored, the signal may be processed to determine the power consumption of multiple components in the appliance for example.

The utility consumption signal may comprise a single phase signal comprising a single current and voltage signal, a split phase (or dual phase) signal comprising two current signals and two voltage signals, or a three phase signal comprising three current signals and three voltage signals.

The electricity consumption data may be acquired from a collection apparatus such as described in US9250101 or any smart meter or smart plug device for example.

In an embodiment in which the input includes an additional current input for a solar PV installation and/or a home battery storage unit, the activity of the solar PV and home battery storage are removed from the input utility consumption signal at this stage. Thus the signal from any additional current input is removed from the sampled current consumption of the unit. Solar PV and home battery storage are very noisy, but can be removed by the addition of an extra current clamp to each device, allowing the signal specific to these devices to be measured, and the measured signal to be removed from the sampled current for the whole house. The utility consumption signal processed in the following steps in this case is the sampled current for the whole house, minus the current of any home battery storage power unit, minus the current of any solar PV unit.

Further processing of the input current and voltage signals may be performed at this stage. For example, a power signal may be generated, or a RMS current signal, as described in relation to Figure 7 for example. Such signals may be used for step detection for example.

In S2, for a first portion of the input signal, a value corresponding to each of one or more characteristics of the power consumption signal is extracted for each of a plurality of windows. Each window comprises a plurality of power signal samples, where a fixed number of samples is contained in each window. A time series of data for each of the one or more characteristics comprising the window values is thus generated for the first portion. The time series has a data point frequency which is lower than the first sampling frequency.

Thus each portion in the raw higher frequency input is transformed, or“down-sampled”, into one or more lower frequency time series streams or “channels”, each corresponding to a different characteristic.

The first portion of the signal may be designated or identified using a step detection method. In this case, the first portion of the signal corresponds to a change in the signal. Step detection is described in more detail in relation to Figure 7 below. For example, where multiple appliances are monitored, or a single appliance with multiple components, step detection and background removal may be performed based on step events. In one embodiment, only portions corresponding to step events are processed. Since these portions are sparsely occurring, this embodiment may be particularly suitable for operation on a mobile processor for example.

Alternatively, the signal may be processed in portions having a fixed length. For example, portions of 5 seconds duration may be processed in sliding windows, which move on by 1 second. In this embodiment, the whole data stream is processed in portions of fixed duration, for example 5 seconds.

The first portion of the signal comprises a current signal (or current channel) and voltage signal (or voltage channel), each of which is a 1 dimensional array of real numbers.

A background removal step may be performed on the first portion, as described in relation to Figure 7 for example. At least one of the characteristics extracted for the first portion corresponds to at least one frequency component of the signal, for example at least one of the fundamental or harmonic frequencies of the signal. The fundamental frequency corresponds to the mains frequency (also referred to as the utility frequency or line frequency). This is the nominal frequency of the oscillations of the alternating current signal. In the UK this is 50 Hz, whereas in the US this is 60 Hz for example.

In an embodiment, the window length corresponds to an integer multiple of the mains cycle period, for example the windows may be 1 mains cycle in duration. This ensures that each window corresponds to an integer number of mains cycles, and thus each time series for each portion corresponds to an integer number of mains cycles. Where the mains frequency is 50Hz, the time series are calculated over windows that are multiples of 20ms in duration, e.g. 20ms, 40ms or 60ms etc. In an embodiment one of the data point frequency and the mains frequency is an integer multiple of the other.

The windows may be contiguous or may overlap for example. In one embodiment, the windows are 1 mains cycle in duration (i.e. 20ms where the mains frequency is 50Hz) and have a spacing of 1 mains cycle between the start of one window and the start of the subsequent window, such that the windows are contiguous and do not overlap.

Alternatively, sliding (i.e. overlapping) windows of 1 mains cycle in duration, having a spacing of less than 1 mains cycle between the start of one window and the start of the subsequent window may be used. Thus the windows may be spaced by a fraction of a cycle, such that subsequent time windows overlap each other. For example, they may be spaced by 0.5 or 0.25 of a mains cycle. In one embodiment, the windows are spaced by 10ms or 5ms, such that they overlap with subsequent cycles by 50% or 75% respectively. In this case, the data point frequency (i.e. the sampling frequency of the time series) is 100Hz or 200Hz respectively. Thus a spacing of 10ms would result in time windows which overlapped by 50% and move along by half a mains cycle at a time, and 100 samples per second. Alternatively 5ms spacing could be used such that the subsequent 20ms windows overlap by 75%, and result in 200 channel samples per second. In practice, the mains cycle frequency may not be constant, but may continuously drift within a range of values. For example, in the UK, the grid provider is obliged to maintain the mains frequency within the range 49.5Hz to 50.5Hz. In order to compensate for this drift, a step of finding the start of the first mains cycle in the first portion by looking for a specific feature, i.e. an interest point, in the voltage waveform may be included. This feature may be the upwards zero-crossing of the voltage (corresponding to a phasor angle of zero) for example. This point is then used as the start of the first window in the first portion. The start of each subsequent mains cycle in the first portion is then identified. The windows are extracted using these identified points (for example, each window corresponding to one actual mains cycle).

In an embodiment, a step of “standardizing” or“normalising” the mains frequency for each first portion of the data is included. For example, when the first sampling frequency is 10 kHz, the number of samples per mains cycle for a 50Hz mains frequency would be 200. A particular first portion may have 201 samples in each mains cycle to the nearest integer (determined by identifying the start points of the mains cycles) for example (which would correspond to an actual mains frequency of 49.7Hz).. The first portion can be modified to have a 50Hz mains frequency by using an interpolation technique on the voltage and current data, for example, each 201 sample window can be interpolated to 200 sample points. In an embodiment, this may be implemented by reshaping the voltage and current data for the first portion using a non standard sample rate into a 2D array and applying a linear interpolation function across each row. Thus the start of each mains cycle is identified in the first portion, and the cycle duration then“normalised” to 50Hz (e.g. by interpolation).

The rate of change of frequency (RoCoF) is assumed to be low enough that over the duration of the first portion (which may be for example 5 seconds or 10 seconds) any change in the mains frequency is negligible. It is possible to approximate a single mains frequency during the first portion (e.g. 49.7 Hz). This mains frequency for the first portion may additionally be determined, since the number of samples in the event and the number of cycles in the event is known (e.g. for 100 cycles and 19800 samples, finding the mains frequency is a simple division of sampling frequency / samples per cycle = 10,000 / 198 « 50.5Hz). Extraction of values for the one or more characteristics is described in greater detail in relation to Figures 4 and 5 below. In particular, the at least one of the characteristics corresponding to at least one of the fundamental or harmonic frequency components may correspond to the magnitude of the component, or the phase angle of the current component relative to the voltage component. Characteristics corresponding to the magnitude may be referred to as fundamental or harmonic magnitude values H,. Extraction of values corresponding to fundamental or harmonic magnitudes is described in relation to Figure 4(a). Characteristics corresponding to the phase angle may be referred to as“theta values” q,. Extraction of values corresponding to theta values is described in relation to Figure 4(b). The index i refers to the frequency component, where i=1 refers to the fundamental frequency (mains frequency, e.g. 50Hz), i=2 refers to the second harmonic (2 x mains frequency, so 100Hz in the example), and so on. The harmonics are the sinusoidal components present in the frequency spectrum at frequencies that are integer multiples of the fundamental frequency. In the UK/Europe, where the fundamental frequency is 50Hz, this will mean 100Hz, 150Hz, 200Hz... up until the Nyquist frequency, which is one half of the sampling frequency of the monitoring hardware. Where the data is“normalised” for any drift in the mains frequency as described above, the“nominal” mains frequency (i.e. 50Hz) corresponds to the fundamental frequency for each window and each first portion.

As well as one or more characteristics corresponding to harmonic magnitude and/or theta values, a time series corresponding to one or more characteristics such as real power (P), reactive power (Q) and apparent power (S) may also be extracted for example.

For a split phase (or dual phase) signal comprising two current signals and two voltage signals, two time series may be generated for each characteristic, corresponding to each phase. Similarly, for a three phase signal comprising three current signals and three voltage signals, three time series may be generated for each characteristic.

Additionally, a time series corresponding to water flow or gas flow may be extracted from water consumption data and gas consumption data respectively. Although in the described method, the same window size has been used for each “channel” (i.e. real power, harmonic magnitude, etc), in an embodiment, different window sizes may be used for different channels.

One or more lower frequency time series of data, at least one of which corresponds to at least one of the fundamental or harmonic frequencies, are extracted from the raw data. A 1 dimensional time series of data is extracted for each characteristic for the first portion of the signal. One dimensional time series of real power values P, reactive power values Q, apparent power values S, harmonic magnitude values H and theta values Q capture different aspects of the current and voltage waveforms and electrical characteristics of the appliance drawing power.

Step S3 comprises generating a classifier input from the one or more time series. In an embodiment, the one or more one dimensional time series may themselves be the classifier input, and are used directly. However, further processing may be performed on the one or more time series in this step, for example one or more of buffering, stretching and normalisation of the time series may be performed to generate the classifier input, as described in relation to Figure 7 below for example.

Step S4 comprises generating an event probability value or values using a CNN based classifier. The classifier generates an event probability value for each of one or more appliance or appliance component state changes. The event probability value is the probability that the first portion of the signal corresponds to the change in state of the appliance or component.

The classifier input is inputted into the classifier in order to generate the values. The classifier input is obtained from the time series, for example the time series themselves may be directly input, or the processed (e.g. buffered/stretched, normalised) time series generated in S3 may be input. In this method, the inputs to the CNN are one or more 1 dimensional time series, each corresponding to a different characteristic (for example real power, apparent power, reactive power, one or more harmonic magnitudes and/or one or more theta values). Each time series corresponds to a different channel of data extracted from the original data, analogous to red, blue and green channels used for image recognition based applications for example. The number of time series or channels corresponds to the“depth” of the input data. Using a CNN provides improved performance for classification, in particular since the input relates to signal processing and local structure (i.e. correlations between local data points).

CNNs are a class of Neural Network (NN) that include one or more convolutional layers. A convolutional layer comprises a set of linear filters, each characterised by a weights vector, and an additive bias term per filter. During operation, which corresponds to forward propagation, each filter is convolved across the input data. Each filter is a 2D array, comprising a 1 D patch or template (which during operation is convolved across the input data) for each channel. Thus for a patch length of 3 and 4 channels, each filter is a (4 x 3) array, which is convolved across the 4 channel input data. Each convolutional layer may have a different number of filters. Convolution of the filter across the input data results in 1 -dimensional output data corresponding to each filter. The“stride” defines how much the filter is shifted along the input data in each step of the convolution, and is defined for each layer. The convolution operation in the convolutional layers (and the neurons in the fully connected layers) are followed by a differentiable non-linear activation function. In an embodiment, the activation function may be a 0-threshold ReLU (Rectified Linear Unit) function, a sigmoid function or a tanh function for example. The 1 D data output from the filter is inputted to the activation function.

The output of each layer is then fed as the input to the subsequent layer, which may be a further convolutional layer or a different type layer for example. In particular, one or more pooling layers may be included between one or more convolutional layers, which essentially down-sample the data.

The weights and biases which characterise each layer are learned before operation during the training stage, which will be described below in relation to Figure 11. These are the trainable parameters. Some of the layers, such as the max-pooling layers, may not have any trainable parameters.

There is a large variety in CNN architectures; however all comprise stacked layers, which learn hierarchical features useful for classifying an input. Each subsequent layer uses the features output by the previous layer as its input. The top level features of a CNN correspond to the most abstract, complex features that the model learns from its training data. When the trained CNN is deployed, it reacts to similar complex structures in the inputs to those it learned during training. The most general architectures for CNNs are standard feed-forward convolutional networks whose lower layers are convolutional layers which may be followed by traditional fully connected layers. A number of other component layer types may be used, including but not limited to max pooling layers and batch or weight normalisation layers.

The final layers of the CNN comprise one or more classifier layers. These may be referred to as“dense” layers. A flattening layer may be included to convert the output of the final convolutional (or pooling) layer into a 1 -dimensional feature vector to be input into the classifier layer. The outputs of the C output neurons in the final convolutional (or pooling) layer are thus flattened and then converted into a categorical probability distribution by a classifier, for example a SVM or softmax based classifier. The classifier layer converts the vector of real numbers output from the flattening layer into a vector of probabilities, which is a legitimate probability distribution (i.e. sums to 1). The probabilities correspond to the likelihoods that the CNN input is a member of a particular class (i.e. corresponds to a particular appliance or device state change). The number of outputs of the classifier layer (i.e. the number of probabilities) is defined during initialisation, and corresponds to the possible classes (i.e. the appliance or device state changes represented in the training data).

The final fully connected layer has C output neurons, where C is the number of unique classes in the data (i.e. the possible appliance and component state changes, which are defined during the initialisation step during the training process). The activations of the C output neurons are converted into a categorical probability distribution by the classification function, for example a linear SVM or a softmax classifier. Each classifier choice has a corresponding standard loss function (hinge loss for the linear SVM and cross-entropy loss for the softmax classifier) which is used during the training stage. In some embodiments, where true labels are available during operation (for example where users input information about which appliance has changed state) then the loss can also be obtained during operation, and the CNN trained during operation.

Figures 3(b) and 3(c) show example CNN models which may be used in the method of Figure 3(a). Figure 3(b) shows an 8 layer network, comprising two convolutional layers, followed by a max pooling layer, two convolutional layers followed by a max pooling layer, two convolutional layers followed by a max pooling layer, then finally a flattening layer, a“dense” layer, a dropout layer and a“dense” layer. The input data comprises 4 channels or time series (P, Q, H, theta), each comprising 250 data points (corresponding to the“width”). The“height” of the input data is zero, since the time series are 1 -dimensional. There are 32 filters in the first layer, and 248 possible positions for the filters along the input (using a stride of 1 , and no zero padding), thus the output of the first layer is 248 x 32. The kernel dimension is kept at 3 pixels width. The activation function used is ReLU. In this case, the CNN comprises two classifier (or “dense”) layers, with the dropout layer between as the final layers. The dropout layer essentially eliminates outputs from certain neurons (i.e. filters) from the network. The final dense layer has 10 outputs, each corresponding to a class (i.e. a state change of an appliance or component). The receptive field of a filter corresponds to the number of inputs in the input data that it responds to. Each filter in this case corresponds to a width of 3 pixels activations in the layer immediately below, thus the receptive field of the first layer is 3, the second layer is 5 and so on. Probability during training for dropout layer is 0.5. Figure 3(c) shows an alternative model, having 11 layers.

The list of appliance and/or component state changes, which correspond to classes and sub-classes in the classifier, and the corresponding event probability values outputted from the CNN may be the final output of the system, or may be inputted into a probabilistic model for further processing, as described in relation to Figure 14 below. Alternatively, the most probable device state change corresponding to the first portion of the signal as determined by the CNN is simply outputted from the system as the final output.

Figure 4(a) shows an example method of extracting one or more time series corresponding to fundamental and/or harmonic magnitude values, which may be used in the method of Figure 3(a).

In step S1 , the sampled values of the current and voltage signals for the window are each transformed into the frequency domain. This step may involve applying a Fourier transform, or any other operation suitable to transform the sampled values into the frequency domain. In an embodiment, the current signal and voltage signal (each of which is a 1 dimensional array of real numbers) for the first portion are each transformed into a 2 dimensional array with dimensions (m_windows, n_sampiesperwindow) · A transform into the frequency domain (e.g. an FFT or FFT variant such as real FFT (RFFT)) is then applied to each row of each 2 dimensional array. This results in an array of complex numbers with dimension (m_windows, ^WsamplesP^erwmdow) for each of the current and voltage.

In step S2, magnitude values I for one or more of the fundamental and harmonic frequency components for each window are determined. The time series value for the window for the characteristic is then determined from this value. The magnitude value for a particular frequency component for a particular window corresponds to the modulus of the complex number in the array entry corresponding to the window (row) and frequency component (column) in the array corresponding to the current.

For each window, a value of the magnitude of the fundamental frequency component is taken, resulting in a time series of data corresponding to the magnitude of the fundamental frequency component. Additionally or alternatively, for each window, a value of the magnitude of the second harmonic frequency component is taken, resulting in a time series of data corresponding to the magnitude of the second harmonic frequency component. Additionally or alternatively, for each window, a value of the magnitude of one or more of the other harmonic frequency components is taken, resulting in one or more time series of data corresponding to the magnitude of the one or more other harmonic frequency components. Additionally or alternatively, for each window, the magnitude of the DC component (zero frequency) is taken.

Additionally or alternatively, there may be a“harmonic channel” in which for each window, the RMS of all harmonic components present in the current is taken. This may be expressed by the formula:

where I, is the magnitude of the frequency component corresponding to the ith harmonic for the window, where i = nyquist corresponds to the Nyquist frequency. This results in a single time series of data. This is also the numerator of the equation for Total Harmonic Distortion. In an embodiment, one or more of the channels corresponds to a parameter related to total harmonic distortion. The total harmonic distortion corresponds to the component of the reactive power that comes from the harmonics. This may be indicative of the appliance type.

Additionally or alternatively, there may be one or more “harmonic channels” each corresponding to the RMS of a set of harmonics, for example a continuous set of harmonics.

In one embodiment, the lower harmonics are of more interest, and an individual channel corresponding to each of the first X harmonics is used, together with a number of “collected” channels for higher order harmonics. In an embodiment, X=6. In an embodiment, the harmonic channels are:

In an embodiment, the sampling frequency is greater than or equal to 1 MHz. In an embodiment there are included one or more harmonic channels corresponding to H_i:j where i is a harmonic component in the region of 10kHz and j is a harmonic component in the region of 30kHz. In an embodiment there are included one or more harmonic channels corresponding to H_{i :}j where i is a harmonic component in the region of 30kHz and j is a harmonic component in the region of 100kHz. In an embodiment there are included one or more harmonic channels corresponding to H_i:j where i is a harmonic component in the region of 100kHz and j is a harmonic component in the region of 300kHz. In an embodiment there are included one or more harmonic channels corresponding to H_{i :}j where i is a harmonic component in the region of 300kHz and j is a harmonic component in the region of 1 MHz. In an embodiment there are included one or more harmonic channels corresponding to H_{i :}j where i is a harmonic component in the region of 1MHz. In an embodiment, there are included one or more harmonic channels which correspond to broad high frequency bands. Additionally or alternatively, there may be a“harmonic channel” corresponding to two or more of the even harmonics. In an embodiment, there is a harmonic channel corresponding to the RMS of all the even harmonic components. Even harmonics behave differently and are only present in currents which have asymmetric IVs (i.e. for which the current signal is not symmetric about the time axis, when any DC component is removed, for example appliances which contain a diode component). For example, all even harmonics may be grouped into a single channel, whilst having individual channels for lower order odd harmonics:

The Nyquist frequency is typically an even harmonic, and the equation follows this assumption. Such time series may be effective since the role of even harmonics is minor in comparison to the odd harmonics.

Figure 4(b) shows an example method of extracting Q values.

In step S1 , the sampled values of the current signal for the window are transformed into the frequency domain, as in Figure 4(a). Where both harmonic magnitude and theta values are used, this step need only be performed once, and the magnitude and theta values can be extracted from the same spectral data.

In step S2, theta values Q for one or more of the fundamental and harmonic frequency components are determined. The time series value for a characteristic is then determined from these values. The“theta channel(s)” record the phase angle of major frequency component(s) present in current with respect to the voltage.

In one embodiment, for each window, a value 0_! of the phase angle of the current fundamental frequency component relative to the voltage fundamental frequency component is taken, resulting in a time series of data corresponding to the 0_! values. The 0_! channel records the phase angle of the current’s fundamental frequency relative to the voltage’s fundamental frequency, i.e.:

Z.l

where ^* is the argument of the complex number corresponding to the fundamental

.···- V\

frequency for the window for the current array, and ^{' s} is the argument of the complex number corresponding to the fundamental frequency for the window for the voltage array.

Additionally or alternatively, a set of one or more theta channels corresponding to harmonic frequency components may be used. The harmonic theta trajectories also record the phase angle of the current harmonic relative to the phase of the voltage fundamental frequency component.

where k is the desired harmonic.

For example, the fundamental plus the 3rd, 5th and 7^th theta values may be used.

Figure 5 shows an example method of extracting a time series corresponding to harmonic magnitude values and/or a time series corresponding to theta values.

In S501 , the first portion of the signal, comprising a current signal and voltage signal (each of which is a 1 dimensional array of real numbers) is each transformed into a 2 dimensional array with dimensions (rri_windows, n_sampiespe,wi_ndow)·

In S502, a transform into the frequency domain (e.g. an FFT or FFT variant such as RFFT) is then applied to each row of each 2 dimensional array. This results in an array of complex numbers with dimension (rriwindows, ^{Ws amples}P^erwmdow) for each of the current and voltage. Since the input data is real values, the output of the frequency transform is symmetric, and only one side need be kept.

In S503, calculation of the harmonic and/or fundamental magnitude channel or channels is performed. For each harmonic/fundamental frequency channel H,, for each row of the array corresponding to the current, the relevant magnitude value is extracted.

For example, where the channel corresponds to the value of the magnitude of a particular frequency component, the magnitudes for the column corresponding to the component for each row are extracted to form the time series. The magnitude for a particular frequency component for a particular window corresponds to the modulus of the complex number in the array entry corresponding to the window (row) and frequency component (column) in the array corresponding to the current.

Where the channel corresponds to the RMS of two or more frequency components, for each row, the RMS of the magnitudes for the respective current columns corresponding to the components is calculated. For example, for H_2:Nyquist:2 (i.e. the RMS of all the even harmonics), for each row, the RMS of columns (3, 5, ... ) up to the column corresponding to the Nyquist frequency is calculated. The RMS values for each row form the time series.

The H, time series for the first portion then comprises the values H_H from 1=1 to l=L, L being the total number of windows in the first portion.

In S504, which may be performed as well as or instead of S503, the theta channel or channels calculation is performed. The theta channel calculation is performed by first finding the argument of each of the complex numbers corresponding to the fundamental frequency (second column) in the voltage array. Thus for each row, the argument of the complex value in the fundamental frequency column of the voltage

%/_<

array is determined. This results in the series of values ⁵ . Next, the argument of the complex numbers in the column corresponding to the desired frequency component k (i.e. the k+1th column) for the current array are calculated. This resu _ll_tts . in the va ,lues Z/ ^' *, .

Finally, the theta channel values are calculated as:

The 0_k time series for the first portion then comprises the values 0_W from 1=1 to l=L, L being the total number of windows in the first portion.

One or both of steps S503 and S504 may be performed. In addition, step S505 may also be performed, in which one or more of the real, apparent and reactive power values are determined. Thus as well as at least one characteristic corresponding to at least one of the fundamental frequency or harmonic frequencies of the signal, one or more of the real power, reactive power and apparent power values may also be used.

In an embodiment, a real power value P_| for a window I is calculated from:

where N is the number of samples in each window I and v_n(t) and i_n(t) are the current and voltage values for the sample n of the voltage and current signals, with or without background removed. The real power time series for the first portion then comprises the values R from 1=1 to l=L, L being the total number of windows in the first portion. T is the duration of the window I. For a 10kHz sample rate in a country with 50Hz main power, where the window corresponds to 1 mains cycle, N = 200.

In an embodiment, an apparent power value |S|i for a window I is calculated from:

The apparent power time series for the first portion then comprises the values |S|i from 1=1 to l=L, L being the total number of windows in the first portion.

In an embodiment, a reactive power value Qi for a window I is calculated from:

The reactive power time series for the first portion then comprises the values Qi from 1=1 to l=L, L being the total number of windows in the first portion.

In an embodiment, a time series corresponding to P and a time series corresponding to Q are present.

Figure 6 shows an example of the time series for a first portion of the signal corresponding to an onset event of a bag-less vacuum cleaner. The figure shows real power P, reactive power Q, a channel corresponding to the RMS of all harmonic components up to the Nyquist frequency (Harmonics) and the 0_! channel. Each window corresponds to one mains cycle. The first portion of the signal corresponds to 140 windows.

The four time series may be inputted directly into a CNN based classifier, or further processing steps may be performed, as will be described below.

Figure 7 shows a flow chart illustrating a method of processing a power consumption signal in accordance with an embodiment. In this method, step detection is performed. The first portions of the signal, which are then analysed, are identified by the step detection. A background removal step is also performed on each portion of the signal to be analysed, before the time series corresponding to the characteristic(s) are calculated.

The time series corresponding to each characteristic is then buffered. Alternatively it may be stretched. Such steps are particularly suitable where step detection with varying potion length is used. Step detection may result in portions of differing length being analysed. A step of buffering or stretching of the signal may be included to ensure uniform input length to the classifier. A normalisation step is also included to generate the classifier input from the time series.

In S1a, the raw current i(t) and voltage v(t) signals are input. This step corresponds to S1 in Figure 3.

In step S1b, a total power consumption metric is obtained, for example real power, apparent power (the calculation of which has been described above) or RMS current. A real power (or RMS current or apparent power for example) value is obtained for windows corresponding to a plurality of samples. The windows may be the same length and spacing as those used for the subsequent generation of the time series. Alternatively, windows of different lengths or spacing can be used.

In step S2a, a step detection algorithm is performed. In this method, the step detection algorithm is performed using the total power consumption metric (e.g. power or RMS current). For non-intrusive load monitoring of a multitude of appliances, this step allows detection and extraction discrete ‘step events’. These events may be extracted in portions of fixed or varying duration.

An example step detection process will now be described, however any alternative method for extracting portions of the signal corresponding to events can be used. In the described method, changes in real power level that exceed a threshold and are preceded and followed by quiet periods of low variance are detected. These detected steps are output in windows of varying duration, which may be in the range from 0.6 seconds up to maximum of 5.0 seconds, dependant on the time taken for the signal to settle into a low variance regime.

The real power signal is first monitored for a change in the magnitude. Alternatively, apparent power or RMS current values may be used (or two or more of the total power metrics may be monitored). In an embodiment, a change in magnitude is detected if a change greater than a threshold value is detected. In an embodiment, the threshold is 10 Watts. In an embodiment, a change is detected by calculating the difference between adjacent power samples in the real power time series signal, referred to as the deltapower. When the deltapower value exceeds the threshold value, an event is detected. Alternatively, an event may be detected only if the sum of the deltapower values within a specified time window exceeds the threshold value for example. In an embodiment, the time window is 600ms. In an embodiment, where the utility consumption signal comprises a split or three phase electrical signal, monitoring for a change in the magnitude of the input signal may comprise generating one or more of the corresponding real, apparent or RMS current signals from each pair of current and voltage signals, and monitoring each. A change detected in any of the signals is registered as a change.

When a change is detected, a first portion of the power time series signal comprising the detected change in magnitude is extracted and stored as will be described in the following. As well as the power time series, the portion of the current signal and the portion of the voltage signal corresponding to the same time period are also extracted and stored as part of the first portion. The portions may be stored on a local device, or on remote storage, for example a cloud computing system.

Extracting the first portion comprises the following steps. A portion of the signal before the detected magnitude change is first extracted and stored as part of the first portion. The duration of this portion is referred to as the left-settle period t_setiieieit- The portion of the signal is referred to as the left settle portion. The left settle period is an amount of time leading up to the detected change. In an embodiment, the left-settle period is of the order of 50 cycles (1 second). The left settle period may be smaller in some cases however, for example if another event had been extracted from the live data within 1 second before the currently observed transition or if the energy disaggregation system itself has just switched on.

In an embodiment, in order to extract the left settle period, the time location of the beginning of the detected magnitude change in the power signal is determined. This can be determined by calculating a rolling variance for the signal, generating a left settle period threshold value from these variance values, and identifying the location of the beginning of the detected change in the power signal as the location at which the signal crosses the left settle period threshold value. The left settle period is then extracted as the period from a fixed time before this location, up to the detected time location of the beginning of the detected change. In an embodiment, the left settle period threshold is 0_{|eft-settie-th}r_esh = max(0_|eft-min, awning), where Oi_em-min is a constant, and in an embodiment is a constant in the region of 50W real power. O_|r0iiing is the rolling variance.

Figure 8 shows an example of an extracted first portion, which comprises a captured event caused by a clothes dryer switching on. The dashed vertical lines in the plot show the boundaries between the left settle period _tieieft described above, a transition period transition and a right settle period t_settieright-

After the transition has been detected, and the left settle portion of the signal extracted and stored, a point at which the deltapower values are below another threshold value for a set number of cycles is looked for. This duration is referred to as the“right settle period” and the corresponding portion of the signal the right settle portion. The threshold value and number of cycles may be constant.

It is monitored for such a right settle portion for a time t_evmax after the beginning of the first portion. In other words, the duration of the first portion is upper bounded at t_evmax· In an embodiment, t_evmax is 5 seconds. In another embodiment, it is 10 seconds. The inputted signal is extracted and stored until a right settle period is detected. Detection of a right settle period implies a steady state has been observed in the live real power data.

Each first portion can thus be divided into three separate phases: the left settle period tsettieieft, a transition period transition and a right settle period t_settieright- The transition portion tr_ansition is the part of the first portion between the left and right settle portions, in which the transition is considered to occur. It is the part of the signal from the detected magnitude change up to when the settle criteria, which is used to detect the right settle portion, has been found. Thus tevent ^— tsettieieft transition tsetieright-

In a further embodiment, monitoring for the right settle period may comprise two stages, a first stage in which it is monitored for a right settle period according to a first settle criteria and a second stage in which it is monitored using a second settle criteria. The second stage is performed if the first stage failed to find a right settle period within the upper bound for the duration of the first portion, i.e. within t_evmax· In this embodiment, it is first monitored for whether a right settle period occurs within t_evmax after the beginning of the first portion according to the first settle criteria. This comprises: a) Setting a first standard deviation threshold o_strict = max(o_right-min, o_{right-reiative}), where o_right-min is a constant, and in an embodiment is a constant in the region of 50W real power. o_{right-reiative} = A_{|rO|eft-settie} when the event was an onset, i.e. a device switching into an active state, i.e. where the detected change was an increase. o_{right-reiative} ⁼(1^i_r)^ai_eft-settie when the event was on offset, i.e. a device switching into an off state, i.e. where the detected change was a decrease. l_|G is a constant scaling factor and in an embodiment is of the order of 2, and 0_|eft-Settie is the standard deviation of the real power values in the left settle period. b) It is then determined whether the real power values during an initial period of ri_max-cyc- right power cycles after the detected transition has a standard deviation equal to or less than o_strict. If so, this period is designated as the right settle period. If not, it is determined whether each subsequent rolling period of the same duration has a standard deviation equal to or less than o_strict· In an embodiment, n_max-cyc-hght is in the region of 25 cycles.

For each subsequent rolling period, the n_max-cyc-hght cycle window is moved along e.g. 1 cycle and checked again. This is continued until either: i) The n_{max-cyc-right} window has a standard deviation less than or equal to o_strict - in this case the right settle period is found; or

ii) The end of the n_max-Cyc-right window reaches t_evmax·

In other words, if o_S|iding £ o_strict then the right settle period is found, where o_S|iding is a vector of “sliding window” standard deviation values, and in an embodiment is calculated by taking the standard deviations of the power signal with a 200ms sliding window moving along the first portion in steps of 100ms or less.

If a right settle period is found, then the end of the first portion is set as the end of the right settle period. In this case, t_event < t_evmax· If a settle period is not found according to the first criteria, it is searched for using the second criteria, which comprises: a) Setting a second standard deviation threshold o_reiaxed = A_reiaxedmin(o_S|iding), where A_reiaXed is a constant, and in an embodiment is of the order 1.25, and O_siiding is a vector of “sliding window” standard deviation values, and in an embodiment is calculated by taking the standard deviations of the power signal with a 200ms sliding window moving along the first portion in steps of 100ms or less. b) If the real power values during the final n_max-cyc-right cycles of the portion of the signal up to t_evmax have a standard deviation less than or equal to o_reia_Xed , then that period is set as the right settle period. In an embodiment, n_max-cyc-right is in the region of 25 cycles.

If a right settle period is found, then the end of the first portion is set as the end of the right settle period. In this case, t_event = tevmax-

If not, it is searched for the longest period within the portion of the signal up to t_evmax with standard deviation less than or equal to o_reiaXed and this is used as the right settle period. The end of the first portion is set as the end of the right settle period.

The part of the data stream after the right settle period is excluded, and the first portion has been extracted and stored, corresponding to the portion from the start of the left settle period to the end of the right settle period. The portion after the right settle period is then processed as a separate event, i.e. it is searched for a change in magnitude.

The horizontal dashed lines in Figure 8 show the trigger boundaries, or“settle criteria” for the left settle period i.e. corresponding to 0_{|eft-Settie-thresh} and right settle period.

In an embodiment, the values of some or all of the thresholds and periods used in the method can be determined by an optimiser that re-runs the disaggregation method, and evaluates disaggregation performance over a range of values for the constants. In step S2b, a background removal step is performed. An example method of background removal will be described, however any method of removing a background signal may be used.

In the example, the background is determined by generating a“mean off cycle”, by generating a mean current cycle from whichever of the left settle period and the right settle period corresponds to the off state. If the detected change is an increase then the left settle period will be the off state and if it is a decrease then the right settle period is the off state. Generating a mean current cycle involves calculating the mean current value for each discrete time location in the current cycle, resulting in a mean current vector which is representative of the background. Removing the background then involves subtracting this mean current vector from the current vector of each cycle of the first portion. This equates to removing the signature of any background devices, assuming that that they are steady state.

In S2c, the remaining time series are calculated. The time series may also be referred to as “trajectories” or “signatures”. One or more time series have already been determined in step S1b in this case (e.g. the real power time series). The remaining desired time series are then calculated in S2c. In this step, for a first portion of the input signal extracted in S2a and b, a value corresponding to each of one or more characteristics of the utility consumption signal is extracted for each of a plurality of windows. Each window comprises a plurality of signal samples. A time series of data for each of the one or more characteristics comprising the window values is thus generated for the first portion. The time series has a data point frequency which is lower than the first sampling frequency. At least one of the characteristics corresponds to at least one of the fundamental or harmonic frequencies of the signal. Where real power and/or apparent power values have already be generated in step S1b, these values may also be included as time real power and apparent power time series. Extraction of the time series values has been described in detail above.

In step S3a, the time series are buffered. This step is optional, and may be performed where the duration of the first portion is not fixed. In this case, capture of events from many different types of device in a household, some of which transition in ms, while others require as many as 10 seconds or more in order to finish transitioning, is possible. CNNs are well suited to multi-channel input, however they may have improved performance where the dimensionality of the input data is constant across the training examples and the inputs during operation. Where the first portions may be of varying duration, a step of standardizing the size of the input data may therefore be included.

In one embodiment, a constant input size, for example 5 seconds, is selected. Any shorter events are then buffered to this size. Any longer events are shortened to this size, or discarded.

For each portion shorter (e.g. of duration 1.8 sec) than the chosen input size (e.g. 5 sec), buffering may be performed as follows:

Create an empty data buffer of the chosen input size (e.g. 5 sec) - this may be an array of dimension (n_channels, n_samples_sec x duration), for example (4, 250) in an example with 4 channels and 250 samples in the chosen window size.

Place the unbuffered event in a given starting position within the buffer, for example flush with the left hand side, or exactly in the centre - in the example where the unbuffered event is 1.8 sec, the sampling rate 50Hz, and the standard input duration 5s, the unbuffered portion duration would be 90 samples, and there would be 160 samples to buffer (where the starting placement is on the left hand side of the buffer, 160 samples must be buffered to the right, where the starting position is in the centre of the buffer, 80 samples would need to be buffered to the left, and 80 to the right).

Buffering of empty values on the left hand side (if any) is performed by repeating the first value of the unbuffered event, meanwhile the empty buffer values on the right hand side (if any) are buffered by repeating the final value of the unbuffered event.

Alternative implementations may add random noise to the repeated values. This will be described in more detail below in relation to data augmentation.

Alternatively, the values may be stretched or compressed to the chosen portion length for example. Figure 9 shows a buffered time series. The time series correspond to those in Figure 6, after buffering has been applied. In this example the time series were buffered by repeating the first and final unbuffered values, as described above.

In an embodiment, a step of normalising the time series may be applied, in step S3b. This step may be applied after buffering for example.

The normalisation may be“per channel” normalisation or“per sample” normalisation for example.

A step of normalising the input time series may be included when using a large CNN model, for example a model with 5 or more layers. For smaller CNN models, a normalisation step may be omitted. For training large CNN models, a step of normalising the datasets such that the whole collection of training examples (not necessarily individual examples) meet requirements set by certain statistical properties may be applied. Such normalisation (using the statistical requirements generated during training) is then also applied during operation.

In one embodiment,“per sample” (also referred to as“per pixel”) normalization is used. In this approach, the mean and variance of each“feature” or sample in the dataset are respectively normalised, such that across the training dataset the mean value of the feature is zero and the variance of all values of that feature is 1.

Thus during training, for example, H₅₂ (the 52nd sample of the RMS harmonic time series) may be normalised across the training data set by subtracting the mean value of H₅₂ from all events in the training dataset, and then dividing by the variance of all events in the training dataset. The mean normalisation (i.e. using the mean and the variance used during training) applied to each feature is then stored to be reapplied during operation.

During operation, the same procedure is applied to each time series, using the mean and variance values generated during training. Thus the stored mean and variance is used to normalise new sections of interest during deployment. In an embodiment, the mean and variance values may be updated during operation. In an alternative embodiment,“per channel” (or“per image”) normalisation is used. In this method, each time series for each event in the training data set is mean-variance normalised, such that the mean and variance of the time series across each individual event are respectively 0 and 1. For certain time series (e.g. real power), the mean- variance may be performed across a population at training time, such that the mean and variance over the population are respectively 0 and 1 , but vary for individual data. For these time series, the mean and the variance for each channel from the training data is recorded, and used in the normalization during operation.

Figure 10 shows the time series of Figure 9 once normalisation has been applied.

Steps S3a to S3c are example steps performed to generate a classifier input from the time series. The input is then inputted into the CNN in S4.

Before the classifier is deployed for operation, it is trained using a training data set. The CNN may be optimised during training against a misclassification loss function such as softmax, with weights adjusted by gradient descent. After training is complete, the activations of the final fully connected layer neurons form the new representation of the electrical system or deltas.

Figure 11 shows an example method of training a CNN, illustrated using a training flowchart. The training is performed using a training data set. The training data set may comprise a large number of current and voltage signals, corresponding to many different electrical appliances and components for example. The training data set is labelled, with information identifying particular sections of the current and voltage signals as corresponding to appliance or component state changes (for example, a portion may be labelled as“washing machine pump switch on”). The label may be inputted as a "one-hot" vector of all zeros with a single“1” in the corresponding position for the true label (i.e. true state change). For example, the training data set may comprise a large number of portions, each portion comprising a current and voltage signal corresponding to a state change of an appliance or component, and a ground truth label indicating the appliance or component state change. The training data is pre-processed in the same manner as described in relation to steps S1 to S3 in Figure 3 (or for example in steps S1a to S3b in Figure 7).

As explained previously, training data may comprise extracted steps which are of varying duration, so a method of standardising the size of the input data may be used. In an embodiment, a constant input size, for example 5 seconds is selected, and buffering of any shorter events to this size is performed. The same buffering (to the same input size) in then used during operation. The maximum event duration is likely to be of the order of 5 seconds, thus this may be selected as the standard input width. The number or time series n_Channeis is the input height.

In the flow chart, steps S1 to S4 correspond to initialisation steps. The learning process is performed in steps S5 onwards, iteratively in batches of training data.

In S1 , the labelled training dataset (and optionally a labelled validation data set) is obtained. A defined CNN architecture is used and initialised. The CNN architecture defines the number and type of layers and the hyper-parameters for example. The number of inputs (i.e. number of channels) and number of outputs (i.e. number of possible classes in the training data) are also defined. An example CNN architecture is shown in Figures 3(b) and (c) for example. The CNN architecture comprises a number of hyper-parameters. The hyper-parameters may include one or more of the type of layers in the network (e.g. presence of dropout layers, the presence of maxpooling layers), the number of layers in the network (e.g. number of each type, e.g. number of convolutional layers), the sequence of layers (e.g. where to include max pooling or normalisation layers), the non-linear activation function used in the layers (ReLUs for example), the dropout probability for any dropout layers during training (e.g. probability 0.5), whether to use zero padding or not, the number of filters in the convolutional layers, the kernel dimensions of the filters in each layer (e.g. 3 or 5 pixels width), and the strides of the kernels (e.g. 1) for example. One or more of these may be optimised during training using the validation data set.

The loss function is selected and defined in S2. Loss functions that may be used include a cross entropy loss function or hinge loss function for example. The loss function is a measure of the compatibility between the prediction (i.e. the event probability values generated by the CNN) and the ground truth label in the training example. For each training data point, the loss function is a function of the weights and bias variables, the ground truth label and the output.

In S3, the optimiser is selected and defined. The optimiser may be a gradient descent based optimiser, for example a stochastic gradient descent optimiser (SGC), a Momentum optimiser or an ADAM optimiser. The optimiser hyper-parameters are also defined, for example the learning rate, and if applicable momentum and acceleration parameters, learning rate decay policy etc. These may also be optimised during training using the validation data set.

The weight and bias variables are initialised in S4 for all of the layers. Standard initialisation techniques include the“Glorot” or“Xavier” technique used to initialise the weights. The bias values may be initialised to zero or small positive numbers for example.

In S5, the learning process is commenced. Steps S5 to S11 are performed iteratively, using batches of the training data, until convergence of the validation error is reached, or sufficient time is deemed to have passed. Absolute convergence of the validation error may not occur, in which case continued training steps with a decaying learning rate may cause the validation error to improve arbitrarily slowly. The model may therefore be judged to be sufficiently trained after a number of hours or days of training on a GPU for example.

In S5, the batch of training examples for the iteration is selected, for example whole batch or mini-batch. For example, the training may be performed on a graphic processing unit (GPU) for example, in which case the batch size is selected dependent on the memory of the GPU.

In S6, the model is run using the weights and biases and inputting the batch of training examples. This results in the event probability values for each class and sub-class, corresponding to each appliance or component state change, outputted at the final layer for each training example. In this step, forward propagation, or a forward pass is performed with each training example in the batch. The loss function is used to determine a loss value for each training example, calculated using the output of the final layer and the label for the training example. The loss gives a measure of the compatibility between the prediction (i.e. the event probability values generated by the CNN) and the ground truth label for the training example. The result is a vector of loss values, one loss value corresponding to each training example. Thus in S7, the class probability vector and loss for each training example is output. The loss may be used as a performance metric for example (e.g. when the loss converges, the training may be stopped or the learning rate decreased).

In S8 to S10, a back propagation algorithm is performed. The steps are performed for each layer in the model, starting at the final layer and working backwards, to update the weights and bias variables.

S8 comprises, for each training example in the batch, calculating the gradient of the loss with respect to each of the parameters (i.e. weights and biases) in the neural network, based on the back-propagated error signal and the feedforward activation signals (inputs to each layer). Every operation performed in the forward pass (i.e. the dot-products of the filters/fully connected-layer weights with the layer inputs, the non linear activation functions, and the loss/classifier functions) is differentiable, therefore a functional expression for the derivative of the loss with respect to every single weight/bias can be determined by virtue of the chain rule. The gradient values are calculated from these expressions using the back-propagated error and the activations (inputs) for each layer from the forward pass (cached during the forward pass).

This results in an array of gradient values, each corresponding to a weight/bias, corresponding to each training example. These are converted to a single gradient value for each weight/bias (for example by taking the average of the gradient values for all training examples for the particular weight/bias).

In S9, the gradient for each weight/bias is used to calculate the updated layer weight and bias variables from the previous values using the optimizer function (i.e. a gradient descent type optimiser). The input to the optimiser function for each weight/bias value is the previous weight/bias value, the gradient corresponding to the weight/bias value and a learning rate parameter. In general, gradient descent based optimizers update the weight in the direction of steepest descent of the loss function (w.r.t. the weight) scaled by a learning rate parameter h. In S10, the update is applied to the layer, i.e. the weight and bias variables are replaced with the new values. The process then returns to S5 and iterates with another batch of examples.

Optionally, an evaluation step S11 is included, which is performed using the validation data set. This may be performed once every 2000 steps for example. In this step, a validation error is determined. If the validation error converges then the model may be deemed sufficiently trained. Alternatively, this step may be omitted and the training simply performed for a set period of time. For example, validation error may be determined by looking at the percentage of examples where the maximum probability class is the same as the label. The validation data set can be used to optimise the hyper-parameters. A range of hyper-parameter combinations can be tried for an appropriate number of epochs, and those resulting in the best validation score can be chosen for the final model for example. In an embodiment, a basic Stochastic Gradient Descent optimiser can be used, together with tuning of the learning rate using the validation data for example. An "epoch" corresponds to all the training data passed through once. This is done by sampling without replacement. For example, an epoch of 10,000 images could be trained on in 10 mini-batches of 1 ,024 training data points.

The final updated model is then used (with the stored variables) during operation.

Although an example training method has been described, any method of training a CNN may be used. Further steps such as dropout for model regularization may be included for example.

Figure 12 shows a plot of the training accuracy () and Figure 13 shows a plot of the loss for 100 epochs of training the 11 layer architecture shown in the Figure 3(c). In both images, one curve is for the training set predictions, the other curve for the validation set predictions. The term "epoch" means an entire pass through the training data.

Figure 14 shows an example evolution of the first 8 filters in the first convolutional layer at different stages during training. Each subplot shows a Hinton diagram where each weight is represented by a shaded square of varying size. The area of each square is proportional to the weight’s magnitude, and the sign (negative or positive) of each weight is indicated in black or white respectively. The input time series (real power, reactive power, thetal and harmonics in this case) and the respective weights for each kernel for each of the first 8 filters in the first layer are shown. The receptive field for the filters is 3.

From the plot, it can be seen that the learning algorithm is able to automatically learn features corresponding to a range of electrical phenomena, for example the first filter shows anti-correlated oscillations in real vs reactive power (whereby the weights are large and have opposite sign for the real and reactive power samples), while the fifth filter learns a feature for correlated drops in real and reactive power (whereby the weights have the same sign).

During training, an optional step of data augmentation may be included. In order to improve training and predictive performance for deep learning systems large amounts of training data are used. A step of data augmentation may be included to enrich a training dataset and achieve improved predictive performance. Data augmentation is not performed during operation.

Data augmentation involves creating artificial additional training examples by copying real training examples and modifying them. One or more of the following data augmentation techniques may be applied (i.e. alone or in combination) to create additional training examples.

Time variation based techniques may be used, for example jittering or time warping.

Jittering modifies the training examples by direction to produce new examples. For example, for a relatively scarce class such as an event from a certain power tool or a microwave model with an unusual magnetron, it may be desirable to create artificial versions of the data. The term class is used to refer to a particular appliance state change (for example “kettle off to on”). Some more complex appliances such as washing machines and dishwashers can be approximately modelled as comprising two or more independent elements that have their own binary on/off states. Device classes corresponding to these types of devices may have multiple“sub-classes”. Additional events are created by copying genuine events and then shifting the time location by translation to the left or right within the buffer by a random or predetermined amount (for example +/- 10%). Figure 15(a) shows an example of an event that had a left starting position in the buffer, that was then jittered 7.6% rightwards in the buffer, corresponding to 19 mains cycles.

Time warping stretches or squashes events by a random or pre-determined amount, for example +/- 10%.

Alternatively or additionally, amplitude variation based techniques may be used, for example varying the amplitude of one or more of the time series by a pre-determined or random amount.

Alternatively or additional, random or cyclic noise may be added to one or more time series.

Figure 15(b) shows an example method of data augmentation that may be used when training the CNN.

In S1 , a target count (i.e. number of training data points) is determined per class. In S2, the number of augmented training examples to be created from each original training example is selected for each training example in the class. To create each augmented example, a copy of the original training example is made in S3, and the set of augmentation operations is selected in S4. The augmentation operations may be jittering, stretching, warping, amplitude variation and/or noise addition for example. For each operation, in S5, the parameters are determined in S5 (which may be random, semi-random or deterministic) and the augmentation is applied to the training example copy in S6. The newly created training example is then saved in S7. These steps are repeated for each class.

In one embodiment, the output of the CNN (i.e. the event probability values) is input into a tracking system.

In one embodiment, the tracking system is a Factorial Hidden Markov Model (FHMM) based tracking system. Figure 16 shows a flow chart illustrating a method using a FHMM based system. In an alternative embodiment, the tracking system may be an LSTM based tracking system. An LSTM is another type of neural network architecture that can take sequences of arbitrary length as input, and produce a variety of outputs, including fixed length outputs, series outputs of varying length, and multiple series outputs.

In both cases, the CNN is trained initially and then the tracking system is trained in a subsequent step. Using a trained CNN as input to an LSTM tracking system can allow the CNN to continue to adapt its parameters based on the final system performance during the training of the LSTM as well.

Figure 16 shows a flow chart illustrating an example method performed by an FHMM based tracking system. The FHMM based tracking system takes as input a list of appliance state changes and the corresponding event probability values generated by the CNN based classifier for a first portion of the signal. In this example, the first portions are extracted using a step detection method and comprise a change in magnitude of the signal.

Step S901 comprises inputting power step size and classifier output probabilities for a detected event or combination of sub-events occurring at time-step t. Time step t is taken to be a single time step corresponding to the first portion of the signal. Thus whereas sub-events may occur at a different time within the first portion, when input to S901 they are treated as having occurred at the same, single time step. An event, and possibly one or more combinations of sub-events are input for the time step, with the associated parameters being (1) the power step magnitude (e.g. change in real and/or apparent power over the event), and (2) the event probability values, i.e. classifier output probabilities for each device state change. Further associated information may be inputted.

The method thus processes the utility consumption data at discrete time points, each corresponding to an event. The time points are irregular, and the method only processes data at time points at which an event has been observed.

In S902, any classes or sub-classes, i.e. appliance or component state changes, with event probability values below a threshold value may be excluded. The remaining probability values are then re-normalised. In an embodiment, the probability threshold is a relative probability threshold. In an embodiment, the threshold value is 20% of the highest event probability value.

Each appliance d in the unit is assumed to have a limited number of discrete states m_d. Determining the state at time t allows disaggregation of the appliance.

S903 comprises, for each of one or more existing“house state” models, determining a survival probability value and/or a hazard rate. A“house state” model comprises stored information indicating a state for each appliance and indicating a state duration for the appliance if in an active state. The existing house state models are the house state models which were generated at the previous event detection time step, i.e. at time step t-1. If no house state models were generated at a previous event detection, a single“background” house state is used as the existing house state, i.e. in which all of the devices are in the off state.

A house state can be thought of as the combined states of each device in the house. Although the term“house” is used, as discussed previously, the house state model relates to the combined states of each appliance in any kind of unit (e.g. commercial or residential building) or each appliance attached to a monitoring line or smart plug or smart meter type device.

The number of possible house states is therefore P d^devices ^md· For example, in a house with only two devices, both of which only have two possible states (on and off) there are a total of 4 possible house states. However in a real situation, there may be in the region of 20 devices that are being kept track of, which have in the region of 2-5 modelled device states each. A stored model of the available devices and their possible states is used to generate the house state models, where the house state model comprises information indicating the state (and state duration if in an active state) of each of the available devices. The stored model thus may comprise a list of devices, e.g. dishwasher, washing machine, etc., and the possible states for these devices. For some types of devices, the stored model will have only one of the device, e.g. dishwasher. For others, e.g. hobs, the stored model will have multiple devices. The stored model may be based on use data about the actual devices in the unit for example, or based on a standard model of the typical devices. The stored model thus determines the number of possible house states, since it comprises a set number of available devices, each with a set number of possible states.

The house state survival probability value is calculated from a device state survival probability value calculated for each device which is in an active state in the existing house state model. The device state survival probability value is the probability that the device remains in the state given the state duration and the time of the new event, i.e. time step t. Thus for each house-state from time step t-1 , the survival probability is calculated by taking the product of the survival probabilities for each of its active devices, i.e. the probability of the device continuing in the active state until time t. Alternatively, or additionally a hazard rate value may be generated and used instead of the survival probability value.

Thus in this step, for each existing House State hs_t-i in the model’s previous time step t-1 , the method calculates/updates the combined survival probability of the active appliance’s states. For house states in which more than one device is in an active state, the total survival probability is the product of each individual device’s survival probability.

A duration based switching probability model generated prior to implementation using the hazard rate (also known as Force of Mortality) from survival analysis may be used to determine the survival probability for each device state.

Prior to implementation, a transition probability model is generated for each device active state and stored in the system. This model comprises the variation of the probability that the device transitions from the active state into another state with the duration of time since the device transitioned into the active state. The model may comprise a transition probability value for each discrete time duration value. Details of how the model may be generated are given in relation to Figure 17 below, which relates to the training stage.

In an embodiment, in S903, for each device in an active state in each existing house state model, the transition probability at the time t is determined from the stored model. The duration is calculated from the time t and the state duration for the device in the existing house state model, by taking the state duration and adding the time between time step t-1 and time step t. The survival probability is calculated from the transition probability. The survival probability is simply the probability that none of the transitions took place, i.e. one minus the probability of transition given the current active state duration. The house state survival probability value is calculated by taking the product of each of the device state survival probability values for the house state.

For example, if the existing house state of 100W has persisted for a very long period (say 20 minutes), this is much longer than any device is expected to be in a rest state for (excluding fridges, which may be dealt with by a separate algorithm), thus any new house states featuring active devices in a rest state will be assigned zero probability and excluded in S903.

In S904, these house state survival probability values are normalised, and any existing house states, i.e. house states at t-1 , whose survival probabilities are below a threshold value are excluded. In an embodiment, the threshold value is 2% of the highest survival probability value of the existing house states. Thus any candidate house state(s) for which one or more appliance has been in its current state longer than the stored model suggests is likely, are excluded.

Thus a set of existing house state models is produced. Although in this method, various tests have been performed to remove unlikely device state changes and existing house state models, alternatively all device state changes and existing house state models may be automatically included.

The following steps S905 to S908 are performed for each existing house state hsu .

For each existing house state, the following steps S905 to S908 are performed for each possible device state change or combination of device state changes. As described above, some of the device state change(s) may have been excluded in S902 for example. In an embodiment, the heating device superclass is also replaced by the individual devices at this stage.

In these steps, new house states are generated by updating the existing house state models to reflect each allowed change in state of one or more of the devices. In step S905 the allowed changes in state and thus the allowed new house states are determined according to a stored set of rules and a stored model.

Thus rather than using a transition matrix that holds the probabilities of each house state transitioning to any other house state, only transitions that match certain criteria in combination with the stored model are considered and generated in S905. The criteria or rules may comprise one or more of the following:

• Only a single device may change state at any one time step (where the input to the method comprises information about sub-events, consideration of more than one device switching for a single time step may be allowed where certain further criteria are met, for example, that one or more of the devices is known to be in the house with a high probability);.

• Only house state models having a total signal magnitude less than the current total signal value of the utility signal may be generated;

• Devices can only change state at the time that an event was detected in the power stream, i.e. at time step t;

• The only state changes allowed are those expected to create power observations of the same sign as that observed;

• The only state changes allowed are those expected to create power observations of the same approximate magnitude as that observed (the expected observations can be determined from a learned signal difference probability distribution generated during training, as described in relation to Figure 17 below);

• The only state changes allowed are those into a different state to the existing state (i.e. if a device is already on in the existing house state, it cannot then turn on) according to the stored model of the device state changes and allowed transitions.

These criteria are used in combination with the stored model of the available devices and the type of transitions allowed for each type of device. In the stored model, most devices are limited to one instance, however some, such as electric hobs and halogen lighting may be allowed multiple instances. The model may be based on user data for example. The criteria specify that only new house states with permitted device transitions and states according to the stored model are generated. Further criteria, such as those based on the magnitude of the signal may also be included.

In the stored model, some devices may have multiple state changes. For example, a washing machine can be modelled as having two dominant components, its heating element (‘HE’) and its motor (‘M’) which can be modelled as functioning independently and as having their own binary work/rest states. The heating element and motor both vary somewhat predictably throughout a wash cycle, with the former being more active at the beginning and the latter being active at the end, thus there is some dependency, however, once information regarding duration is included they can be treated as functioning independently. This results in four possible states for the washing machine: {off, rest, HE, M} where‘rest’ implies that both HE and M are resting, ΉE’ implies HE is working and M is resting, and so on.

All other transitions are considered to have zero probability and house states which do not meet the criteria are not generated. This equates to having a large and very sparse transition matrix.

Thus in an embodiment, the possible house states are determined based on the magnitude and sign of the power change. Transitions are sought which would be an ‘emission’ or power step similar to the one observed in the event which was received as input. Transitions that would have an emission with a power step of the opposite sign to the observation are automatically excluded, and a new house state corresponding to this combination of the existing house state and device state transition is not created. To examine the suitability of the transmission to the observed power step, a probability is obtained by querying a fitted parametric signal difference probability distribution, generated during training.

For example, if the power consumption in the house at a given time is 1300W and this power level has been reached by a 1200W step-up from a consumption level of 100W (which may be treated as a background level, i.e. no devices are in an active state), then based on the assumptions above, the only house states that we treat as having non-zero probability and thus generate are:

i) those featuring a device capable of‘emitting’ a 1200W stepup;

and ii) those transitioning into an active state, i.e. a state in which the device is both actually in use and also appears on because it is consuming power, for example the “work” state of work/rest devices but not the“rest” state.

In this case, the only possible new house states as determined in S905 are those in which all devices are off except for one, and that one must be a device that we believe could operate around 1200W.

In an embodiment, in S905, only new house states for which the device state change (e.g. on to off) is allowed, given the devices’ state in the existing house state and one more rules are generated.

Whilst the total number of possible house states is equal to P d^devices ^md _> the overwhelming majority of these states and the transitions into them can be modelled as having zero probability at time T in this step, i.e. are not allowed according to the stored criteria and model as the observations do not fit with them, and thus are not generated.

The method then proceeds to S906, in which house states corresponding to the allowed houses states determined in S905 are generated or updated for each combination of existing house state and device state change(s) remaining after S902 and S904.

For a particular existing house state combined with a device state change from S S’, in S906, it is checked whether a house state hsy with appliance d in state S’ and with the same state duration, and with the other appliances in the states in which they were in the existing house state with the same duration, already exists. Such a house state may have been created when steps S905 to S908 were performed for another existing house state and device state change combination, i.e. arriving at the same house state via a different path(s). Such a house state will have an associated total house state probability value P(hs_tj), generated from the information relating to the different path(s) and an associated most likely path argmax^hs^y)) or state history, and the house state probability value of that path max(p(hsi:_t,j)).

If not then a new house state is generated, by updating the existing house state model to reflect the change in state of the or each device. Thus the stored information for the new house state model comprises the stored information for the existing house state model, with the state and state duration for the or each device updated. The state duration is updated to 0, and the state durations of any other active devices in the house state model are updated to take into account the time step. The house state probability value is initially set to zero.

In S907, a new house state probability value for the house state model is calculated from the house state probability value of the existing house state and the event probability value or values corresponding to the change in state of the device or devices. The total house state probability value P(hs_tj) is then updated by adding the new house state probability value to the current total house state probability value, resulting in a new total house state probability value.

The new house state probability value may also be calculated from one or more of a signal difference probability value, which is the probability that the difference in magnitude of the utility consumption signal was caused by the device; a transition probability value, which is the probability that the change in state of the device occurs at the time location of the first portion given the state duration; a prior probability value, which is the probability of device state change relative to the other devices; and a survival probability value and/or hazard rate corresponding to each device which is in an active state in the existing house state, wherein the survival probability value is the probability that the device remains in the state given the state duration. All of these probabilities may be stored values or values generated from stored distributions. These stored values and distributions are generated prior to implementation, from training data.

In an embodiment, the new house state probability P(hs_t, ) is the product of one or more of:

• the total probability of the existing house state hs_{t-i ,} ,_, P(hs_{t-i ,} ,)

• the event probability value(s) of the device(s) P_ciassifier;

• the appropriate prior (either switch-on prior or active state change prior) P_prior;

• the survival probability and/or hazard rate of all active devices from the existing house state hs_{t-i ,} , other than the device or devices which changed state

P Psurvival (other devices); • the transition probability of the device(s) which changed state at this instance in time given the state duration Po_sition;

• the signal difference probability, i.e. the probability that the difference in magnitude of the utility consumption signal was caused by the device, based on the range of possible power steps for this appliance P_{signai difference}-

The transition probability is only included where the device is transitioning from an active state.

Thus for example:

P(Microwave on) — P _ciassifier( icrowave) p _{prj0r sw}^_c^_on(Microwave) P _signal difference(M ICrOWave _ On) P transition_offon (MlCrOWave _ On) and

P(hs_{t, j}) = P(Microwave_on) ^* P(hs_{t-i ,} ,) ^* P P_surviv_ai (other devices)

Where two or more devices change state at the time step, the probability values for each device are included in the product.

In an embodiment, the logarithm of the probability values are used in this step to combine the probabilities, to avoid numerical underflow.

For each house state model, the above probabilities for each path are summed to give the total house state probability. The total house state probability value is thus updated by adding the new house state probability value to the current total house state probability value, resulting in a new total house state probability value.. The house state hs_tj will already have a stored total house state probability value, for example 0 if the house state was generated in S906, and a previously generated value if the house state was already created in S906 (i.e. it was created when steps S905 to S908 were performed for another existing house state and device state change combination). The new house state probability value generated in S907 is summed with this stored total house state probability value to give an updated total house state probability value, which is stored. In an embodiment, the new house state probability value generated in S907 is also compared to the stored most likely path house state probability value in S908. The new house state probability value can be considered as the probability that the house state hs_{t, j} was reached via the house state hs_{t-i ,} , and the device state change. If this path to hs_{t, j} has a higher probability than the previous most likely path to hs_{t, j} , which is stored, then information identifying this path as the most likely path to hs_{t, j} is stored instead. Thus in this step, the most likely path to hs_{t, j} is updated to hsu_, , plus the step from hs_{t-i ,} i to hs_{t, j} (i.e. the device state change(s)) and its probability stored. If it does not have a higher probability, then the most likely path is not updated.

Once these steps have been performed for all of the existing house states and all of the device state changes or combinations of device state changes, there results a set of house state models, with corresponding house state model probability values and most likely path information and probability values.

In an embodiment, these values are then normalised, so that the sum over all the house state probability values is 1. These normalised probability values may be referred to as marginal probability values.

In S910, for each device d, the total probability of it being in each of its allowed states (e.g. {on, off} or {on, rest, off}) at t is obtained. In an embodiment, this is done by, for each state S of the device, taking a sum of the marginal probability of each new house state containing the device d in the state S (e.g.‘on’) at t, and normalising the total probabilities of each state of d so that they sum to one. This gives the probabilities of each state of d at t. This may be the final output from the system to the front end app for example.

Figure 17 is a method of training the FHMM based tracking model, which is performed on the system before implementation. This method generates the stored data, which is used by the model to process the utility consumption data.

The training stage may comprise learning (i) the transition probabilities and (ii) the emission probabilities, which correspond to the signal difference probability. The emission probabilities are the probabilities that a hidden state causes certain observations. In this case, this means the probability that the device being in a certain state causes observations in the utility consumption signal. For example, that the microwave being in the‘on’ state causes a signal change of 1050W to be observed on the power line.

For each device class, i.e. type of device, and optionally each sub-class the following steps are performed:

A training data set is loaded, comprising labelled appliance power consumption data. The data may be low frequency data for example 1 Hz or lower. The power consumption data comprises annotations of which discrete states the appliance was in at each time step. The data may be hand-labelled for example or labelled in a semi automatic manner.

For each active state transition for the device, the following steps S1002 to S1004 are performed in order to generate the transition probability model. An active state transition is a transition from a non-off state, i.e. an active state, into any other state.

In S1002, for the transition S S', for all state pairs S , S' subject to S ¹ S' and S ¹ 'off', the training data set is searched for all of the instances of the device being in state S then changing to state S’. The duration of state S prior to the transition to S’ is extracted from each instance found, and these duration values form an empirical distribution. Thus S1002 comprises extracting an empirical distribution of the device state duration prior to an active state transition.

In S1003 a suitable parametric distribution is fitted to the empirical distribution of the state duration. For example, a gamma distribution may be fitted to the empirical distributions of state durations.

In S1004, this parametric distribution and its cumulative distribution function are stored. These distributions are then used to calculate the state transition and survival probability values during implementation. The distribution is the distribution of the probability of the transition occurring with state duration. Thus prior knowledge of the duration characteristics of types of devices is incorporated into the method through the transition probability. These duration based switching probabilities are learned from a training data set, which may comprise public and/or private household power consumption data. Corresponding state durations are found for each device type (e.g. on-state durations for kettles,‘work’ and‘rest’ state durations for Hobs etc.) and a suitable parametric distribution, such as a student-t or gamma distribution, is fit to this data. This parametric distribution is then used to generate the survival function and/or hazard rate for the given device type and transition probability.

An example of where this is an important factor is in the classification of a Kettle vs. a Dishwasher. Both of these devices feature heating elements which frequently can be rated 3000W. At the point of detection (switch on), there may be no way to distinguish these heating elements from each other. However, kettles rarely boil for more than 5 minutes, whereas dishwasher heating elements remain on for much longer. While the kettle may be the most likely explanation for the observed event at the at the point of detection, at some point after 5 minutes having elapsed, the method will begin to favour the dishwasher as being the explanation for the long duration 3000W power consumption (thus the house states with the dishwasher on will have higher probabilities).

Furthermore, if after 15 minutes of the 3000W heating element being observed on, another 3000W resistive switch-on is seen in the power signal, then the method will determine that a kettle was responsible for this observation with high probability. This is because at the point of this second 3000W event being observed, there is already a high probability that the dishwasher is on, i.e. where the house state or states in which the dishwasher is on have high probability, and the house model assumes there is only one dishwasher in the house (meaning that house state models in which multiple dishwashers are on are not permitted according to the criteria in S905, and therefore another device must be responsible. This corresponds to the branching seen in Figures 12(a) and (b), where, for example, after the Toaster is already on in a given house state, the only transitions that are allowed in response to a second‘on’ observation are those where a non-Toaster device turns on.

The available devices and the types of transitions allowed for each type of device is modelled and the model is stored and used to implement the criteria in S905. This may be done in the training stage, from the training data for example, or may be input by the user. For an on/off device, the only transition that needs to be modelled is the‘on’ ‘off transition. However in a work/rest device, three transitions are modelled:‘work’ ‘rest’,‘rest’ ‘work’ and‘work’ ‘off.

51005 is then performed in order to generate the active state prior probability values. This step comprises recording the total number of active state transitions per unit time for the particular device. In other words, the total number of transitions from a non-off state to any other state for the device observed during a particular time period from each data set in the training set is recorded. Thus the active state change priors can be considered as the relative frequencies of a device changing from an active state. For example, for washing machines in the heating state, the probability of transitioning to the motor state is 52%. This value is determined from extracting the number of washing machine transitions from the heating state to the motor state compared to transitions to other subsequent states in training data. The probability of transitioning from the heating state to the rest state may be 41% and to the off state 7%. These values are stored as the active state prior probabilities.

51006 is performed in order to generate the switch-on prior probability values. This step comprises recording the total number of switch on transitions per unit time for the particular device. In other words, the total number of transitions from an off state to any other state for the device observed during a particular time period from each data set in the training set is recorded. Thus the switch on priors can be considered to be the relative frequencies of the devices switching on, as observed across a range of houses. For example, where the average number of switch on events for a kettle is 5 switch ons per day; for hairdryers is 4 switch ons per day and for toasters is 1 switch on per day, the prior switch on probabilities will be: (Kettle: 0.5, Toaster: 0.1 , Hairdryer: 0.4).

51007 and S1008 are performed in order to generate the emission probabilities, i.e. the signal difference probabilities.

All of the power step values are extracted from the training data set for each state transition for the device. Thus an empirical distribution of the probability against size of power step, or signal difference is generated for each transition type. A suitable parametric distribution is then fitted to the empirical power step data. A Student-t distribution may be used in this step. During implementation, the signal difference probability is determined from this stored signal difference probability model, i.e. by extracting the probability value corresponding to the signal difference value.

In an embodiment, the stored probabilities and models may be updated based on one or more of self-learning from the data inputted during implementation, user feedback and active learning through a user interface (interactive app or website for example). Thus the system may be a user or site specific system, which learns and adapts by online updating of one or more of the stored probability values or distributions based on user feedback, active learning and probabilistic confidences from the tracking system.

The method described above is based on a Bayesian time series approach such as a FHMM. The method is based on determining the latent or“hidden” state of multiple independently functioning devices through time. Each device is modelled as a single latent categorical variable (at a minimum there must be two states:“on” and“off”). In an embodiment, the method is based on a factorial hidden semi-Markov model (FHSMM) with a sparse transition matrix and sparse time points. Within the model, probabilities are passed from the states of one time-step and multiplied by various transition probabilities to determine the state probabilities of the next time-step. This process is named “sum-product-message passing” or “belief propagation”, and is the basic mechanism by which inference is performed.

The model explicitly (i) tracks a specified number of devices in the home, and (ii) models each of these tracked devices as having a number of pre-defined discrete states which consume energy at different power levels. That these states are not directly observable but instead are something to be inferred from the powerline observations is the hidden Markov property of the FHSMM. These states are explicitly hardcoded for each device, and transition functions between them learned. This way of modelling multiple independently functioning devices is the factorial property of the FHSMM.

The method generates the probabilities of states surviving or changing by referring to the stored survival curve of a learned parametric distribution which models a device’s chance of changing state given the duration that it has been in that active state. This mechanism allows the model, in an embodiment, to only be updated at time points when events are passed into the model. Events are may be sparse and irregular. Only updating the model when events are observed makes a computational cost saving versus updating at a regular time interval, e.g. once a second.

In an embodiment, the model can also be queried when no events have been observed. This is useful for checking whether the most likely house state and explanation for the observed energy patterns has changed, i.e. whether house state models still survive given the duration of continuing steady states. In an embodiment, if the device has a heavy computational load it can refrain from doing this.

The probability of a modelled device transitioning from an active state to any other state is only semi-Markov, as it is governed by a transition function, rather than a constant transition probability. The transition function determines the probability of transition given the length of time the device has been in the active state. These transition functions are pre-trained from a large corpus of data and stored.

Each house state is a combination of each allowed set of states of each device in the house according to the stored model and criteria. As the number of devices modelled rises, the number of possible house states rises exponentially. Furthermore, the semi- Markov property means that each house state records the duration(s) that each device has been in its current state, and this increases the storage space required to store the possible states to be even larger.

The method comprises tracking of appliance activity using a bespoke probabilistic model that (i) makes hard-coded assumptions about the devices present in an as yet unobserved residence and their behaviour, based on a training data set. These strong assumptions achieve tractability on a system with constrained resources, such as an loT device.

The method allows processing to be performed only when an event is detected, i.e. with sparse time steps and takes into account transition probabilities generated from a reliability/hazard function, thus analysing the state duration as a feature. Although training and operation using a particular FHMM based model has been described, any FHMM based or other tracking model may be used in conjunction with the CNN. The above described methods are computer-implemented methods for monitoring power consumption of one or more electrical appliances or components, and comprise steps of monitoring power consumption of one or more electrical appliances or components by identifying a power signature of the one or more electrical appliances or components from a power consumption signal. The methods may further comprise steps of controlling power consumption of the appliances or components.

While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made.

Claims

CLAIMS:

1. A power consumption signal processing system for monitoring power consumption, the system comprising:

2. The system according to claim 1 , wherein the power consumption signal has a fundamental frequency and wherein frequency component corresponds to one of the fundamental or harmonic frequencies.

3. The system according to claim 2, wherein at least one of the characteristics relates to the amplitude of two or more harmonic frequency components.

4. The system according to claim 2 or 3, wherein at least one of the characteristics comprises the amplitude of a harmonic frequency component.

5. The system according to any of claims 2 to 4, wherein at least one of the characteristics comprises the root mean square of two or more even harmonic components.

6. The system according to any of claims 2 to 5, wherein the power consumption signal comprises a current signal and a voltage signal and wherein at least one of the characteristics corresponding to at least one of the fundamental or harmonic frequencies comprises the phase angle of the current relative to the voltage.

7. The system according to any preceding claim, wherein the characteristics of the utility consumption signal further comprise one or more of: real power, apparent power and reactive power.

8. The system according to claim 1 , wherein the power consumption signal comprises a current signal and a voltage signal and wherein the characteristics of the utility consumption signal comprise the real power, the root mean square of the magnitude of all harmonic frequency components, and the phase angle of the current relative to the voltage at the fundamental frequency.

9. The system according to any preceding claim, wherein the processor is further configured to identify the first portion of the input signal by detecting a magnitude change and identifying the first portion of the input signal as a portion comprising the detected magnitude change.

10. The system according to any preceding claim, wherein the processor is further configured to convert the time series to a standard length.

11. The system according to any preceding claim, wherein the processor is further configured to normalise the time series using parameters obtained prior to implementation from training data.

12. The system according to any preceding claim, wherein the processor is further configured to identify a power signature of one or more electrical devices or components based on the event probability values and monitor power consumption of the one or more electrical appliances or components.

13. The system according to claim 12, wherein identifying the power signature and monitoring the power consumption comprises using a tracking system.

14. The system according to claim 13, wherein the tracking system is an FHMM based tracking system or an LSTM based tracking system.

15. The system according to any preceding claim, wherein the first sampling frequency is greater than or equal to 1 kHz.

16. A method of processing a power consumption signal for monitoring power consumption, comprising:

17. A method of training a power consumption signal processing system to monitor power consumption, the method comprising:

training the classifier using the labels to improve the performance.

18. The method according to claim 17, further comprising:

generating further training data from the training dataset by generating modified copies of the training data.

19. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 16.

20. A carrier medium comprising computer readable code configured to cause a computer to perform the method of any of claims 17 to 18.