WO2018171899A1

WO2018171899A1 - Neural network data processing apparatus and method

Info

Publication number: WO2018171899A1
Application number: PCT/EP2017/057088
Authority: WO
Inventors: Jacek Konieczny
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2018-09-27
Also published as: CN110462637B; CN110462637A; EP3590076A1

Abstract

The invention relates to a data processing apparatus (100) comprising a processor (101) configured to provide a neural network (110), wherein the neural network (110) comprises a neural network layer (120) being configured to generate from an array of input data values (117) an array of output data values (121) based on a plurality of position dependent kernels (118) and a plurality of sub-arrays of the array of input data values (117). Moreover, the invention relates to a corresponding data processing method.

Description

DESCRIPTION

Neural network data processing apparatus and method TECHNICAL FIELD

Generally, the present invention relates to the field of machine learning or deep learning based on neural networks. More specifically, the present invention relates to a neural network data processing apparatus and method, in particular for processing data in the fields of audio processing, computer vision, image or video processing, classification, detection and/or recognition.

BACKGROUND Weighted aggregation, which is commonly used in many signal processing applications, such as image processing methods for image quality improvement, depth or disparity estimation and many other applications [Kaiming He, Jian Sun, Xiaoou Tang, "Guided Image Filtering", ECCV 2010], is a process in which input data is being combined to pack information present in a larger spatial area into one single spatial position, with additional input in form of aggregation weights that control the influence of each input data value on the result.

In deep-learning, a common approach recently used in many application fields is the utilization of convolutional neural networks. Generally, a specific part of such convolutional neural networks is at least one convolution (or convolutional) layer which performs a convolution of input data values with a learned kernel K producing one output data value per convolution kernel for each output position [J. Long, E. Shelhamer, T. Darrell, "Fully Convolutional Networks for Semantic Segmentation", CVPR 2015]. For the two- dimensional case used, for instance, in image processing the convolution using the learned kernel K can be expressed mathematically as follows:

wherein out(x, y) denotes the array of output data values, in(x - i, y - ;^') denotes a sub- array of an array of input data values and K(i,j) denotes the kernel comprising an array of kernel weights or kernel values of size (2r+1 ) x(2r+1 ). B denotes an optional learned bias term, which can be added for obtaining each output data value. The weights of the kernel K are the same for the whole array of input data values in(x, y) and are generally learned during a learning phase of the neural network which, in case of 1 st order methods, consists of iteratively back-propagating the gradients of the neural network output back to the input layers and updating the weights of all the network layers by a partial derivative computed in this way.

SUMMARY

It is an object of the invention to provide an improved data processing apparatus and method based on neural networks.

The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

Generally, embodiments of the invention provide a new approach for weighted

aggregation of data for neural networks that is implemented into a neural network as a new type of neural network layer. The neural network layer can compute aggregated data using individual aggregation weights that are learned for each individual spatial position.

Aggregation weights can be computed as a function of similarity features and learned weight kernels, resulting in individual aggregation weights for each output spatial position.

In this way a variety of sophisticated position dependent or position adaptive kernels learned by the neural network can be utilized for better adaptation of the aggregation weights to input data.

More specifically, according to a first aspect the invention relates to a data processing apparatus comprising one or more processors configured to provide a neural network. The data to be processed by the data processing apparatus can be, for instance, two- dimensional image or video data or one-dimensional audio data.

The neural network provided by the one or more processors of the data processing apparatus comprises a neural network layer being configured to process an array of input data values, such as a two-dimensional array of input data values in(x, y), into an array of output data values, such as a two-dimensional array of output data values out(x, y). The neural network layer can be a first layer or an intermediate layer of the neural network. The array of input data values can be one-dimensional (i.e. a vector, e.g. audio or other e.g. temporal sequence), two-dimensional (i.e. a matrix, e.g. an image or other temporal or spatial sequence), or N-dimensional (e.g. any kind of N-dimensional feature array, e.g. provided by a conventional pre-processing or feature extraction and/or by other layers of the neural network).

The array of input data values can have one or more channels, e.g. for an RGB image one R-channel, one G-channel and one B-channel, or for a black/white image only one grey-scale or intensity channel. The term "channel" can refer to any "feature", e.g.

features obtained from conventional pre-processing or feature extraction or from other neural networks or neural network layers of the same neural network. The array of input data values can comprise, for instance, two-dimensional RGB or grey scale image or video data representing at least a part of an image, or a one-dimensional audio signal. In case the neural network layer is implemented as an intermediate layer of the neural network, the array of input data values can be, for instance, any kind of array of features generated by previous layers of the neural network on the basis of an initial, e.g. original array of input data values, e.g. by means of a feature extraction.

The neural network layer is configured to generate from the array of input data values the array of output data values on the basis of a plurality of position dependent, i.e. spatially variable kernels and a plurality of different sub-arrays at different positions of the array of input data values. Each kernel comprises a plurality of kernel values or kernel weights. A respective kernel is applied to a respective sub-array of the array of input data values to generate a single output data value.

A "position dependent kernel" as used herein means a kernel whose kernel weights depend on the respective position, e.g. (x,y) for two-dimensional arrays, of the sub-array of input data values. In other words, for a first kernel the kernel values applied to a first sub-array of the array of input data values can differ from the kernel values of a second kernel applied to a second sub-array of the array of input data values. In a two- dimensional array the position could be a spatial position defined, for instance, by two spatial coordinates. In a one-dimensional array the position could be a temporal position defined, for instance, by a time coordinate. Thus, an improved data processing apparatus based on neural networks is provided. The data processing apparatus allows to aggregate the input data in a way that can better reflect mutual data similarity, i.e. the resultant output data value is more strongly influenced by input data values that are closer and more similar to input data in the center position of the kernel. Moreover, the data processing apparatus allows adapting the kernel weights for different spatial positions of the array of input data values. This, in turn, allows, for instance, minimizing the influence of some of the input data values on the result, for instance the input data values that are associated with another part of the scene (as determined by semantic segmentation) or a different object that is being analysed.

In a further implementation form of the first aspect, the neural network comprises at least one additional network layer configured to generate the plurality of position dependent kernels on the basis of an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values. The original array of original input values can be the array of input data values or a different array.

In a further implementation form of the first aspect, the neural network is configured to generate the plurality of position dependent kernels based on a plurality of learned position independent kernels and a plurality of position dependent weights. Generally, the position independent kernels can be learned by the neural network and the position dependent weights or similarity features can be computed, for instance, by a further preceding layer of the neural network. This implementation form allows minimizing the amount of data being transferred to the neural network layer in order to obtain the kernel values. This is because the kernel values are not transferred directly, but computed from the plurality of position dependent weights and/or similarity features substantially reducing the amount of data for each element of the array of output data values. This can minimize the amount of data being stored and transferred by the neural network between the different network layers, which is especially important during the learning process on the basis of the mini-batch approach as the memory of the data processing apparatus (GPU) is currently the main bottleneck.

In a further implementation form of the first aspect, the neural network is configured to generate a kernel of the plurality of position dependent kernels by adding the learned position independent kernels each weighted by the associated non-learned position dependent weights (i.e. similarity features). This implementation form provides a very efficient representation of the plurality of position dependent kernels using a linear combination of position independent "base kernels".

In a further implementation form of the first aspect, the plurality of position independent kernels are predetermined or learned, and wherein the neural network comprises at least one additional neural network layer or "conventional" pre-processing layer configured to generate the plurality of position dependent weights (i.e. similarity features) based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values. The original array of original input values can be the array of input data values or a different array. In an

implementation form the at least one additional neural network layer or "conventional" pre- processing layer can generate the plurality of position dependent weights (i.e. similarity features) using, for instance, bilateral filtering, semantic segmentation, per-instance object detection, and data importance indicators like ROI (region of interest).

In a further implementation form of the first aspect, the array of input data values and the array of output data values are two-dimensional arrays, and the convolutional neural network layer is configured to generate the plurality of position dependent kernels w_L(x, y, on the basis of the following equation:

wherein F_f (x, y) denotes the set of Nf position dependent weights (i.e. similarity features) and denotes the plurality of position independent kernels.

In a further implementation form of the first aspect, the neural network layer is a convolutional network layer or an aggregation network layer.

In a further implementation form of the first aspect, the array of input data values and the array of output data values are two-dimensional arrays, wherein the array input data values in(x, y, ) has C_t different channels and wherein the neural network layer is a convolutional network layer configured to generate the array of output data values out x, y, c₀) on the basis of the following equations:

wherein r denotes a size of each kernel of the plurality of position dependent kernels

denotes a normalization factor. In an implementation form the normalization factor can be set equal to 1 .

In a further implementation form of the first aspect, the array of input data values and the array of output data values are two-dimensional arrays, wherein the array input data values in(x, y) has only a single channel and wherein the neural network layer is a aggregation network layer configured to generate the array of output data values out(x, y) on the basis of the following equations:

wherein r denotes a size of each kernel of the plurality of position dependent kernels w_L(x, y, i,j) and W_L(x, y) denotes a normalization factor. In an implementation form the normalization factor W_L(x, y) can be set equal to 1 .

In a further implementation form of the first aspect, the neural network layer is a correlation network layer configured to generate the array of output data values from the array of input data values and a further array of input data values by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels, or by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels associated to the array of input data values and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values. In a further implementation form of the first aspect, the array of input data values inl(x, y), the further array of input data values in2(x, y) and the plurality of position dependent kernels w_L1(x, y, i,fi are two-dimensional arrays and wherein the correlation neural network layer is configured to generate the array of output data values out(x, y) on the basis of the following equations:

wherein r denotes a size of each kernel of the plurality of position dependent kernels w_L{x, y, i,j) and W_L(x, y) denotes a normalization factor. In an implementation form the normalization factor W_L(x, y) can be set equal to 1 .

In a further implementation form of the first aspect, the array of input data values inl{x, y), the further array of input data values in2(x, y), the plurality of position dependent kernels w_L1(x, y, and the plurality of further position dependent kernels w_L2 (x, y, are two- dimensional arrays and wherein the correlation neural network layer is configured to generate the array of output data values out(x, y) on the basis of the following equations:

wherein r denotes a size of each kernel of the plurality of position dependent kernels w_L1(x, y, and of each kernel of the plurality of further position dependent kernels w_L2 (x, y, and W_L12 (x, y) denotes a normalization factor. In an implementation form the normalization factor can be set equal to 1 .

In a further implementation form of the first aspect, the neural network layer is configured to generate a respective output data value of the array of output data values by determining a respective input data value of a respective sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the respective determined input data value as the respective output data value.

According to a second aspect the invention relates to a corresponding data processing method comprising the step of generating by a neural network layer of a neural network from an array of input data values an array of output data values based on a plurality of position dependent kernels and a plurality of sub-arrays of the array of input values.

In a further implementation form of the second aspect, the method comprises the further step of generating the position dependent kernel of the plurality of position dependent kernels by an additional neural network layer of the neural network based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values.

In a further implementation form of the second aspect, the step of generating the position dependent kernel of the plurality of position dependent kernels comprises generating the position dependent kernel of the plurality of position dependent kernels based on a plurality of position independent kernels and a plurality of position dependent weights.

In a further implementation form of the second aspect, the step of generating a kernel of the plurality of position dependent kernels comprises the step of adding, i.e. summing the position independent kernels weighted by the associated position dependent weights.

In a further implementation form of the second aspect, the plurality of position

independent kernels are predetermined or learned and the step of generating the plurality of position dependent weights comprises the step of generating the plurality of position dependent weights by an additional neural network layer or a processing layer of the neural network based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values. In a further implementation form of the second aspect, the array of input data values and the array of output data values are two-dimensional arrays, and the step of generating a kernel of the plurality of position dependent kernels is based on the following

equation:

wherein F_f (x, y) denotes the plurality of N_f position dependent weights (i.e. similarity features) and denotes the plurality of position independent kernels.

In a further implementation form of the second aspect, the neural network layer is a convolutional network layer or an aggregation network layer.

In a further implementation form of the second aspect, the array of input data values and the array of output data values are two-dimensional arrays and the neural network layer is a convolutional network layer, wherein the step of generating the array of output data values is based on of the following equations:

or, wherein the neural network layer is an aggregation network layer and wherein the step of generating the array of output data values is based on the following equations:

In implementation forms, the the normalization factors can be set

equal to 1 . In a further implementation form of the second aspect, the neural network layer is a correlation network layer and the step of generating the array of output data values comprises generating the array of output data values from the array of input data values and a further array of input data values (a) by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels, or (b) by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels associated to the array of input data values and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values.

In a further implementation form of the second aspect, the array of input data values, the further array of input data values and the kernel are two-dimensional arrays and the step of generating the array of output data values by the correlation neural network layer is based on the following equations:

or

In any of the above implementation forms, the normalization factors W_L or W_L12 can be set equal to 1 .

In a further implementation form of the second aspect, the step of generating an output data value of the array of output data valuesby the convolutional neural network layer comprises the steps of determining an input data value of a sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the determined input value as the output value.

According to a third aspect the invention relates to a computer program comprising program code for performing the method according to the second aspect, when executed on a processor or a computer.

The invention can be implemented in hardware and/or software or in any combination thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, wherein:

Fig. 1 shows a schematic diagram illustrating a data processing apparatus based on a neural network according to an embodiment;

Fig. 2 shows a schematic diagram illustrating a neural network provided by a data processing apparatus according to an embodiment;

Fig. 3 shows a schematic diagram illustrating the concept of down-stepping or aggregation of data implemented in a data processing apparatus according to an embodiment;

Fig. 4 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment;

Fig. 5 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment;

Fig. 6 shows a schematic diagram illustrating different processing steps of a data processing apparatus according to an embodiment; Fig. 7 shows a schematic diagram illustrating a neural network provided by a data processing apparatus according to an embodiment;

Fig. 8 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment;

Fig. 9 shows a schematic diagram illustrating different processing steps of a data processing apparatus according to an embodiment; and Fig. 10 shows a flow diagram illustrating a neural network data processing method according to an embodiment.

In the various figures, identical reference signs will be used for identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF EMBODIMENTS

In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the present invention may be placed. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims. For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 1 shows a schematic diagram illustrating a data processing apparatus 100 according to an embodiment configured to process data on the basis of a neural network. To this end, the data processing apparatus 100 shown in figure 1 comprises a processor 101 . In an embodiment, the data processing apparatus 100 can be implemented as a distributed data processing apparatus 100 comprising more than the one processor 101 shown in figure 1 . The processor 101 of the data processing apparatus 100 is configured to provide a neural network 1 10. As will be described in more detail further below, the neural network 1 10 comprises a neural network layer being configured to generate from an array of input data values an array of output data values based on a plurality of sub-arrays of the array of input data values and a plurality of position dependent kernels comprising a plurality of kernel values or kernel weights. As shown in figure 1 , the data processing apparatus 100 can further comprise a memory 103 for storing and/or retrieving the input data values, the output data values and/or the kernel values.

Each position dependent kernel comprises a plurality of kernel values or kernel weights. For a respective position or element of the array of input data values a respective kernel is applied to a respective sub-array of the array of input data values to generate a single output data value. A "position dependent kernel" as used herein means a kernel whose kernel weights depend on the respective position of the sub-array of the array of input data values to which the kernel is applied. In other words, for a first kernel applied to a first sub-array of the plurality of input data values the kernel values can differ from the kernel values of a second kernel applied to a second sub-array of the plurality of input data values forming a different sub-array of the same array of input values.

In a two-dimensional array the position could be a spatial position defined, for instance, by two spatial coordinates (x,y). In a one-dimensional array the position could be a temporal position defined, for instance, by a time coordinate (t).

The array of input data values can be one-dimensional (i.e. a vector, e.g. audio or other e.g. temporal sequence), two-dimensional (i.e. a matrix, e.g. an image or other temporal or spatial sequence), or N-dimensional (e.g. any kind of N-dimensional feature array, e.g. provided by a conventional pre-processing or feature extraction and/or by other layers of the neural network 1 10). The array of input data values can have one or more channels, e.g. for an RGB image one R-channel, one G-channel and one B-channel, or for a black/white image only one grey-scale or intensity channel. The term "channel" can refer to any "feature", e.g. features obtained from conventional pre-processing or feature extraction or from other neural networks or neural network layers of the neural network 1 10. The array of input data values can comprise, for instance, two-dimensional RGB or grey scale image or video data representing at least a part of an image, or a one- dimensional audio signal. In case the neural network layer 120 is implemented as an intermediate layer of the neural network 1 10, the array of input data values can be, for instance, an array of similarity features generated by previous layers of the neural network on the basis of an initial, e.g. original array of input data values, e.g. by means of a feature extraction, as will be described in more detail further below. As will be described in more detail below, the neural network layer 120 can be

implemented as an aggregation layer 120 configured to process each channel of the array of input data values separately, e.g. for an sub-array of the input array of R-values one (scalar) R-output value is generated. The position dependent kernels may be channel- specific or common for all channels. Moreover, the neural network layer 120 can be implemented as a convolution (or convolutional) layer configured to "mix" all channels of the array of input data values. For instance, in case the array of input data values is an RGB image, i.e. a multi-channel array, based on the three corresponding sub-arrays of the three input arrays (R,G and B) only one (scalar) output value is generated for the three channels (R,G and B) of the multi-channel array of input data values. The position dependent kernels may be channel-specific or common for all channels. In the case of a convolution layer 120 the position dependent kernels are generally multi-channel kernels. Furthermore, the neural network layer can be implemented as a correlation layer providing a combination of aggregation or convolution (input image and weighted kernel) and an additional image (i.e. correlation of two, e.g. of the same or at the same position, sub- arrays in the two images with each other and additional application of the position dependent kernel on the correlation result). Also in this case the position dependent kernels may be channel-specific or common for all channels.

Figure 2 shows a schematic diagram illustrating elements of the neural network 1 10 provided by the data processing apparatus 100 according to an embodiment. In the embodiment shown in figure 2, the neural network layer 120 is implemented as a weighted aggregation layer 120. In a further embodiment, the neural network layer 120 can be implemented as a convolution network layer 120 (also referred to as convolutional network layer 120) or as a correlation network layer 120, as will be described in more detail further below. As indicated in figure 2, in this embodiment the aggregation layer 120 is configured to generate a two-dimensional array of output data values out(x, y) 121 on the basis of a respective sub-array of the two-dimensional array of input data values in(x, y) 1 17 and a plurality of position dependent kernels 1 18 comprising a plurality of kernel values or kernel weights.

In an embodiment, the weighted aggregation layer 120 of the neural network 1 10 shown in figure 2 is configured to generate the array of output data values out(x, y) 121 on the basis of the plurality of sub-arrays of the two-dimensional array of input data values in(x, y) 1 17 and the plurality of position dependent kernels 1 18 comprising the kernel values w_L(x, y, i,j) using the following equation:

wherein r denotes a size of each kernel of the plurality of position dependent kernels 1 18 (in this example, each kernel and each sub-array of the array of input values has

(2r+1 )^*(2r+1 ) kernel values respectively input values) and the output data values can be normalized using the following normalization factor:

In other embodiments, the normalization factor can be omitted, i.e. set to one. For instance, in case the neural network layer 120 is implemented as a convolutional network layer the normalization factor can be omitted. For weighted aggregation the normalization factor allows to keep the mean value or DC component. This can be advantageous, when the weighted aggregation layer 120 is used to aggregate stereo matching costs of a stereo image, because the normalization is beneficial for making the output values for different sub-arrays of the array of input data values comparable. This is usually not necessary in the case of the convolutional network layer 120. As will be appreciated, the above equations for a two-dimensional input array and a kernel having a quadratic shape can be easily adapted to the case of an array of input values 1 17 having one dimension or more than two dimensions and/or a kernel having a rectangular shape, e.g. a non-square rectangular shape with different horizontal and vertical dimensions. For an embodiment, where the neural network layer 120 is implemented as a convolution network layer and the array of input data values in(x, y, c ) 1 17 is a two-dimensional array of input data values having more than one channel , such as in the case of RGB image data, the neural network layer 120 can be configured to generate the array of output data values out x, y, c₀) 121 having one or more channels on the basis of the plurality of sub- arrays of the two-dimensional array of input data values in(x, y, c ) 1 17 in the different channels and the plurality of position dependent kernels 1 18 comprising the kernel values w_L(x, y, c₀, Ci, i,j) using the following equation:

wherein C_t denotes the number of channels of the array of input data values

1 17 and the output data values can be normalized using the following normalization factor:

In an embodiment, the neural network layer 120 is configured to generate the array of output data values 121 with a smaller size than the array of input data values 1 17. In other words, in an embodiment, the neural network 1 10 is configured to perform a down-step operation on the basis of the plurality of position dependent kernels 1 18. Figure 3 illustrates a down step operation provided by the neural network 1 10 of the data processing apparatus 100 according to an embodiment. Using a down step operation allows increasing the receptive field, enables processing the data with a cascade of smaller filters as compared with a single layer with a kernel covering an equal receptive field, and also enables the neural network 1 10 to better analyze the data by finding more sophisticated relationships among the data and adding more non-linearities to the processing chain by separating each convolution layer with a non-linear element like a sigmoid or a Rectified Linear Unit (ReLU).

In the down-step operation illustrated in figure 3 the neural network layer 120 can combine the input data values to produce the array of output data values with a reduced resolution. This can be achieved by convolving the array of input data values 1 17 with the position dependent kernels 1 18 with a stride S greater than 1 . The stride S specifies the spacing between neighboring input spatial positions for which convolutions are computed. If the stride S is equal to 1 , the convolution is performed for each spatial position. If the stride S is greater than 1 , the neural network layer 120 is configured to perform a convolution for every S-th spatial position of the array of input data values 1 17, thereby reducing the output resolution (i.e. the dimensions of the array of output data values 121 by a factor of S for each spatial dimension). The horizontal and the vertical stride may be the same or different. In the exemplary embodiment shown in figure 3, the neural network layer 120 combines the array of input data values 1 17 from the spatial area of size (2r+1 )x(2r+1 ) to produce a respective output data value of the array of output data values 121 . In this way, the input data values 1 17 can be aggregated to pack information present in a larger spatial area into one single spatial position.

In the embodiment shown in figure 2, the neural network 1 10 comprises one or more preceding layers 1 15 preceding the neural network layer 120 and one or more following layers 125 following the neural network layer 120. In an embodiment, the neural network layer 120 could be the first and/or the last data processing layer of the neural network 1 10, i.e. in an embodiment there could be no preceding layers 1 15 and/or no following layers 125.

In an embodiment, the one or more preceding layers 1 15 can be further neural network layers and/or "conventional" pre-processing layers, such as a feature extraction layer. Likewise, in an embodiment, the one or more following layers 125 can be further neural network layers, such as a deconvolutional layer, and/or "conventional" post-processing layers.

As shown in the embodiment shown in figure 2, one or more of the preceding layers 1 15 can be configured to provide, i.e. to generate the plurality of position dependent kernels 1 18 (see the bottom signal path of the preceding layers 1 15 from guiding data 1 13 to the position dependent kernels 1 18 in Fig. 2). In an embodiment, the one or more layers of the preceding layers 1 15 can generate the plurality of position dependent kernels 1 18 on the basis of an original array 1 1 1 of original input data values, e.g. an original image as 2D example. As indicated in figure 2, in an embodiment, the original array 1 1 1 of original input data values can be an array of input data 1 1 1 being the original input of the neural network 1 10,. In another embodiment, the one or more preceding layers 1 15 could be configured to generate just the plurality of position dependent kernels 1 18 on the basis of the original input data 1 1 1 of the neural network 1 10 and to provide the original input data 1 1 1 of the neural network 1 10 as the array of input data values 1 17 to the neural network layer 120 (no preceding layers in the top signal path of the preceding layers 1 15 from the original input array 1 1 1 to the input array 1 17 according to an embodiment of the neural network layer 120, see Fig. 2). In other words, according to an embodiment, the original array 1 1 1 may form the input array 1 17.

As indicated in figure 2, in a further embodiment the one or more preceding layers 1 15 of the neural network 1 10 are configured to generate the plurality of position dependent kernels 1 18 on the basis of an array of guiding data 1 13. A more detailed view of the processing steps of the neural network 1 10 of the data processing apparatus 100 according to such an embodiment is shown in figure 4 for the exemplary case of two- dimensional input and output arrays. The array of guiding data 1 13 is used by the one or more preceding layers 1 15 of the neural network 1 10 to generate the plurality of position dependent kernels w_L(x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13. As already described in the context of figure 2, the neural network layer 120 is configured to generate the two-dimensional array of output data values out(x, y) 121 on the basis of the two-dimensional array of input data values in(x, y) 1 17 and the plurality of position dependent kernels w_L(x, y) 1 18, which, in turn, are based on the array of guiding data g x, y 1 13. In an embodiment, the one or more preceding layers 1 15 of the neural network 1 10 are neural network layers configured to learn the plurality of position dependent kernels ^wi,( y) 1 18 on the basis of the array of guiding data gO, y) 1 13. In another embodiment, the one or more preceding layers 1 15 of the neural network 1 10 are pre-processing layers configured to generate the plurality of position dependent kernels w_L(x, y) 1 18 on the basis of the array of guiding data 1 13 using one or more pre-processing schemes, such as feature extraction.

In an embodiment, the one or more preceding layers 1 15 of the neural network 1 10 are configured to generate the plurality of position dependent kernels w_L(x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13 in a way analogous to bilateral filtering, as illustrated in figure 5. Bilateral filtering is known in the field of image processing for performing a weighted aggregation of input data, while decreasing the influence of some input values and amplifying the influence of other input values on the aggregation result [M. Elad, "On the origin of bilateral filter and ways to improve it", IEEE Transactions on Image Processing, vol. 1 1 , no. 10, pp. 1 141 -1 151 , October 2002]. As illustrated in figure 5, the weights 518 utilized for aggregating the array of input data values 517 adapt to input data 517 using the guiding image data g 513 which provides additional information to control the aggregation process. In an embodiment, the array of guiding image data 513 can be equal to the array of input data values for generating the array of output data values 521 by the layer 520 on the basis of weights 518. The bilateral filter weights 518 take into consideration the distance of the value within the kernel from the center of the kernel and, additionally, the similarity of the data values with data in the center of the kernel, as mathematically described by the following equation:

wherein the normalization factor is based on the following equation:

In an embodiment, the bilateral filter weights 518 are defined by the following equation:

wherein d . , . ) denotes a distance function.

Figure 6 shows a schematic diagram highlighting the main processing stage 601 of the data processing apparatus 100 according to an embodiment, for instance, the data processing apparatus 100 providing the neural network 1 10 shown in figure 2. As already described above, in a first processing step 603 the neural network 1 10 can generate the plurality of position dependent kernels w_L(x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13. In a second processing step 605 the neural network 1 10 can generate the array of output data values out(x, y) 121 on the basis of the array of input data values in(x, y) 1 17 and the plurality of position dependent kernels w_L (x, y, 1 18. Figure 7 shows a schematic diagram illustrating the neural network 1 10 provided by the data processing apparatus 100 according to a further embodiment. As will be described in more detail in the following, the main difference to the embodiment shown in figure 2 is that in the embodiment shown in figure 7 the neural network 1 10 is configured to generate the plurality of position dependent kernels based on a plurality of position independent kernels 1 19b (shown in figure 8) and a plurality of position dependent weights F_f (x, y) 1 19a (also referred to as similarity features 1 19). In an embodiment, the similarity features 1 19a are obtained based on the guiding data 1 13 and could indicate higher-level knowledge about the input data 1 1 1 , including e.g. semantic segmentation, per-instance object detection, data importance indicators like ROI (Region of Interest) and many others - all learned by the neural network 1 10 itself or being an additional input to the neural network 1 10. In an embodiment, the neural network 1 10 of figure 7 is configured to generate the plurality of position dependent kernels 1 18 by adding the position independent kernels 1 19b each weighted by the associated position dependent weights F_f (x, y) 1 19a.

In an embodiment, the plurality of position independent kernels 1 19b can be

predetermined or learned by the neural network 1 10. As illustrated in figure 7, also in this embodiment the neural network 1 10 can comprise one or more preceding layers 1 15, which precede the neural network layer 120 and which can be implemented as an additional neural network layer or a pre-processing layer. In an embodiment, one or more layers of the preceding layers 1 15 are configured to generate the plurality of position dependent weights F_f (x, y) 1 19a on the basis of an original array of original input data values or the guiding data 1 13. The original array of original input data values of the neural network 1 10 can comprise the array of input data values 1 17 to be processed by the neural network layer 120 or another array of input data values 1 1 1 associated to the array of input data values 1 17, for instance, the initial or original array of input data 1 1 1 . In the exemplary embodiment shown in figure 7, the array of input data values in(x, y) 1 17 and the array of output data values out(x, y) 121 are two-dimensional arrays and the neural network layer 120 is configured to generate a respective kernel of the plurality of position dependent kernels w_L(x, y, i,j) 1 18 on the basis of the following equation:

wherein F_f (x, y) denotes the set of N_f position dependent weights (or similarity features) 1 19a and K_f denotes the plurality of position independent kernels 1 19b, as also illustrated in figure 8.

Figure 9 shows a schematic diagram highlighting the main processing stage 901 of the data processing apparatus 100 according to an embodiment, for instance, the data processing apparatus 100 providing the neural network 1 10 illustrated in figures 7 and 8. As already described above, in a first processing step 903 the neural network 1 10 can generate the plurality of position dependent weights or similarity features F_f (x, y) 1 19a on the basis of the array of guiding data g(x, y) 1 13. In a second processing step 905 the neural network 1 10 can generate the plurality of position dependent kernels w_L(x, y, 1 18 on the basis of the plurality of position dependent weights or similarity features Ff {x, y) 1 19a and the plurality of position independent kernels K_f 1 19b. In a further step (not shown in figure 9, but similar to the processing step 605 shown in figure 6) the neural network layer 120 can generate the array of output data values out(x, y) 121 on the basis of the array of input data values in(x, y) 1 17 and the plurality of position dependent kernels w_L(x, y, i,j) 1 18.

As already mentioned above, in an embodiment the neural network layer 120 of the neural network 1 10 can be implemented in the form of a correlation network layer 120 configured to generate the array of output data values 121 from the array of input data values 1 17 and a further array of input data values by correlating the array of input data values 1 17 with the further array of input data values and by applying a respective position dependent kernel of the plurality of position dependent kernels 1 18 to a respective sub-array of the array of input data values 1 17 and a corresponding sub-array of the further array of input data values. In case the array of input data values 1 17, the further array of input data values and the plurality of position dependent kernels 1 18 are respective two-dimensional arrays (as in the embodiments shown in figure 2 and 7), the correlation neural network layer 120 can be configured to generate the array of output data values out(x, y) 121 on the basis of the following equation:

wherein inl(x - i, y - ;^') denotes the array of input data values 1 17, in2(x - i, y - ;^') denotes the further array of input data values, w_L1(x, y, i,fi denotes the plurality of position dependent kernels 1 18 and r denotes the size of each kernel of the plurality of position dependent kernels 1 18 (in this example, each kernel has (2r+1 )^*(2r+1 ) kernel values). The output data values 121 can be normalized using the following normalization factor:

In other embodiments, the normalization factor can be omitted, i.e. set to one. As will be appreciated, the above equations for a two-dimensional input array and a kernel having a quadratic shape can be easily adapted to the case of an array of input values 1 17 having one dimension or more than two dimensions and/or a kernel having a non-square rectangular shape, i.e. different horizontal and vertical dimensions.

In a further embodiment, the correlation network layer 120 is configured to generate the array of output data values 121 from the array of input data values 1 17 and the further array of input data values by correlating the array of input data values 1 17 with the further array of input data values and by applying a respective position dependent kernel of the plurality of position dependent kernels 1 18 associated to the array of input data values 1 17 and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values to a respective sub-array of the array of input data values 1 17 and a corresponding sub-array of the further array of input data values. In case the array of input data values 1 17 and the further array of input data values are respective two-dimensional arrays (as in the embodiments shown in figure 2 and 7), the correlation neural network layer 120 can be configured to generate the array of output data values out(x, y) 121 on the basis of the following equation:

wherein inl(x - i, y - ;^') denotes the array of input data values 1 17, in2(x - i, y - j) denotes the further array of input data values, w_L1(x, y, i,fi denotes the plurality of position dependent kernels 1 18, w_L2(x, y, denotes the plurality of further position dependent kernels associated to the further array of input data values. The output data values 121 can be normalized using the following normalization factor:

In a further embodiment, the neural network layer 120 is configured to process the array of input data values 1 17 on the basis of the plurality of position dependent kernels 1 18 using a maximum or minimum pooling scheme. More specifically, in such an embodiment, the neural network layer 120 is configured to generate a respective output data value of the array of output data values 121 by determining a respective input data value of a respective sub-array of the plurality of sub-arrays of the array of input data values 1 17 being associated with a maximum or minimum kernel value of a respective position dependent kernel of the plurality of position dependent kernels 1 18 and using the respective determined input data value as the respective output data value.

In a further embodiment, the neural network layer 120 according to one of the

embodiments described above is used by the neural network 1 10 to perform weighted aggregation of stereo matching costs in order to obtain a depth map from a stereo image. Cost aggregation is a commonly used approach as a method to minimize noise and improve the depth estimation results. Without additional weighting object borders at depth discontinuities would normally be over-smoothed. Consequently, a much desired feature is to preserve these borders by taking into account some additional knowledge about object borders in the scene. Thus, advantageously, the neural network layer 120 can use e.g. object features derived from semantic segmentation as the guiding data 1 13 in order to determine the object borders in the scene and guide the aggregation process of the input stereo matching cost producing the aggregated stereo matching cost as an output.

Figure 10 shows a flow diagram illustrating a data processing method 1000 based on a neural network 1 10 according to an embodiment. The data processing method 1000 can be performed by the data processing apparatus 100 shown in figure 1 and its different embodiments. The data processing method 1000 comprises the step 1001 of generating by the neural network layer 120 of the neural network 1 10 from the array of input data values 1 17 the array of output data values 121 based on the plurality of position dependent kernels 1 18 and the plurality of sub-arrays of the array of input data values 1 17. Embodiments of the data processing methods may be implemented and/or performed by one or more processors as described above.

Referring to back to the various embodiments described above, a first kernel is considered different to a second kernel if a kernel value of the array of kernel values of the first kernel at at least one position (or of at least one element) of the first kernel is different from a kernel value of the array of kernel values of the second kernel at the same position (or of the same element) of the kernel. Typically a kernel has the same size (number of elements, positions or values per dimension) and dimension (number of dimensions N, N>=1 ) as the sub-array of the array of input values it is applied to. Typically the different sub-arrays of the array of input values have all the same size and dimension. Accordingly, the different kernels typically have the same size and dimension.

A first sub-array of the array of input values is considered different to a second sub-array of the array of input values if the first sub-array of the array of input values comprises at least one element of the array of input values which is not comprised by the second sub- array of the array of input values. Typically the different sub-arrays of the array of input values differ at least by one column or row of elements of the array of input values. The different sub-arrays may partially overlap or not overlap as shown in Fig. 3.

In the following some further details about various aspects and embodiments (aggregation network layer, convolution network layer, correlation network layer and normalization) are provided. Aggregation

Embodiments of the proposed guided aggregation can be applied for guided feature maps down-scaling. By using input position dependent kernels as the guiding data, input values which are features of the feature map are grouped to form input data sub-arrays of the input data array and can be further aggregated in a controlled way producing an output feature value representative for the whole sub-array. This way, by changing the resolution of the input feature map object borders and other details that are normally lost while down-scaling, can be better preserved. In such cases, guiding data represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers or an edge map of a texture image corresponding to processed feature map. Convolution

Embodiments of the proposed guided convolution can be applied for switchable feature extraction. Input values which are features of the feature map are convolved with adaptable feature extraction filters which are formed from the input guiding data in form of position dependent kernels. This way, each selected area of the input feature map can be processed with feature extraction filters producing only features desired for these regions. Here, guiding data in form of similarity features represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers, an edge map of a texture image corresponding to processed feature map or a ROI (region of interest) binary map.

Correlation

Embodiments of the proposed guided correlation can be applied for guided correlation of input feature maps. By using input position dependent kernels as the guiding data, input values which are features of the two or more feature maps are correlated together in a controlled way enabling amplification or attenuation of some features within a correlation region. This way, features that correspond to some other objects/regions in the feature map can be excluded or taken with smaller impact to compute the result. Also, some of the features characteristic for a selected region can be amplified. Here, guiding data represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers or an edge map of a texture image corresponding to processed feature map or a ROI (region of interest) binary map.

Normalization

In general, normalization is advantageous if the output values obtained for different spatial positions are going to be compared to each other per-value, without any intermediate step. As a result, preservation of the mean (DC) component is appreciated. If such comparison is not performed, normalization is not required but just increases complexity. Additionally, one can omit normalization in order to simplify the computations and compute only an approximate result.

While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other.

Although specific aspects have been illustrated and described herein, it will be

appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims

1 . A data processing apparatus (100) comprising: a processor (101 ) configured to provide a neural network (1 10), wherein the neural network (1 10) comprises a neural network layer (120) being configured to generate from an array of input data values (1 17) an array of output data values (121 ) based on a plurality of position dependent kernels (1 18; 1 19) and a plurality of sub-arrays of the array of input data values (1 17).

2. The data processing apparatus (100) of claim 1 , wherein the neural network (1 10) comprises an additional neural network layer (1 15) configured to generate the plurality of position dependent kernels (1 18) based on an original array of original input data values (1 1 1 , 1 17) of the neural network (1 10), wherein the original array of original input data values (1 1 1 , 1 17) of the neural network (1 10) comprises the array of input data values (1 17) or another array of input data values (1 17) associated to the array of input data values (1 1 1 ).

3. The data processing apparatus (100) of claim 1 or 2, wherein the neural network (1 10) is configured to generate the plurality of position dependent kernels (1 18) based on a plurality of position independent kernels (1 19b) and a plurality of position dependent weights (1 19a).

4. The data processing apparatus (100) of claim 3, wherein the neural network (1 10) is configured to generate a kernel of the plurality of position dependent kernels (1 18) by adding the position independent kernels (1 19b) weighted by the associated position dependent weights (1 19a).

5. The data processing apparatus (100) of claim 3 or 4, wherein the plurality of position independent kernels (1 18) are predetermined or learned and wherein the neural network (1 10) comprises an additional neural network layer (1 15) or processing layer (1 15) configured to generate the plurality of position dependent weights (1 19a) based on an original array of original input data values (1 1 1 , 1 17) of the neural network (1 10), wherein the original array of original input data values (1 1 1 , 1 17) of the neural network (1 10) comprises the array of input data values (1 17) or another array of input data values

(1 1 1 ) associated to the array of input data values (1 17).

6. The data processing apparatus (100) of any one of claims 3 to 5, wherein the array of input data values (1 17) and the array of output data values (121 ) are two-dimensional arrays and the neural network layer (120) is configured to generate a kernel of the plurality of position dependent kernels (1 18) on the basis of the following equation:

wherein F_f (x, y) denotes the plurality of N_f position dependent weights (1 19a) and K_f denotes the plurality of position independent kernels (1 19b).

7. The data processing apparatus (100) of any one of the preceding claims, wherein the neural network layer (120) is a convolutional network layer or an aggregation network layer.

8. The data processing apparatus (100) of any one of the preceding claims, wherein the array of input data values (1 17) and the array of output data values (121 ) are two- dimensional arrays, wherein the neural network layer (120) is a convolutional network layer configured to generate the array of output data values (121 ) on the basis of the following equation:

with

wherein out{x, y, c₀) denotes the array of output data values (121 ), in(x, y, c{) denotes the array of input data values (1 17), r denotes a size of each kernel of the plurality of position dependent kernels w_L(x, y, c₀, c_u i,j) and w_L(x, y, c₀) denotes a normalization factor, or, wherein the neural network layer (1 20) is an aggregation network layer configured to generate the array of output data values (1 21 ) on the basis of the following equation:

with:

wherein out(x, y) denotes the array of output data values (1 21 ), in(x, y) denotes the array of input data values (1 1 7) , r denotes a size of each kernel of the plurality of position dependent kernels w_L (x, y, i,fi and W_L (x, y) denotes a normalization factor.

9. The data processing apparatus (1 00) of any one of claims 1 to 7, wherein the neural network layer (1 20) is a correlation network layer configured to generate the array of output data values ( 1 21 ) from the array of input data values ( 1 1 7) and a further array of input data values by: correlating the array of input data values (1 1 7) with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels (1 1 8), or correlating the array of input data values (1 1 7) with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels (1 1 8) associated to the array of input data values (1 1 7) and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values.

1 0. The data processing apparatus (1 00) of claim 9, wherein the array of input data values (1 1 7) , the further array of input data values and the plurality of position dependent kernels (1 1 8) are respective two-dimensional arrays and wherein the correlation neural network layer (120) is configured to generate the array of output data values (121 ) on the basis of the following equation:

with:

wherein out(x, y) denotes the array of output data values (121 ), inl(x, y) denotes the array of input data values (1 17), in2(x, y) denotes the further array of input data values, denotes a size of each kernel of the plurality of position dependent kernels

and W_L (x, y) denotes a normalization factor, or

with:

wherein out(x, y) denotes the array of output data values (121 ), inl(x, y) denotes the array of input data values (1 17), in2(x, y) denotes the further array of input data values, r denotes a size of each kernel of the plurality of position dependent kernels

and of each kernel the plurality of further position dependent kernels and

W_L(x, y) denotes a normalization factor.

1 1 . The data processing apparatus (100) of any one of claims 1 to 7, wherein the neural network layer (120) is configured to generate a respective output data value of the array of output data values (121 ) by determining a respective input data value of a respective sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the respective determined input data value as the respective output data value.

12. A data processing method (1000) comprising:

generating (1001 ) by a neural network layer (120) of a neural network (1 10) from an array of input data values (1 17) an array of output data values (121 ) based on a plurality of position dependent kernels (1 18) and a plurality of sub-arrays of the array of input data values (1 17).

13. A computer program comprising program code for performing the method (1000) of claim 12, when executed on a computer and/or a processor.