WO2018171899A1 - Neural network data processing apparatus and method - Google Patents

Neural network data processing apparatus and method Download PDF

Info

Publication number
WO2018171899A1
WO2018171899A1 PCT/EP2017/057088 EP2017057088W WO2018171899A1 WO 2018171899 A1 WO2018171899 A1 WO 2018171899A1 EP 2017057088 W EP2017057088 W EP 2017057088W WO 2018171899 A1 WO2018171899 A1 WO 2018171899A1
Authority
WO
WIPO (PCT)
Prior art keywords
array
data values
input data
neural network
position dependent
Prior art date
Application number
PCT/EP2017/057088
Other languages
French (fr)
Inventor
Jacek Konieczny
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2017/057088 priority Critical patent/WO2018171899A1/en
Priority to EP17713634.8A priority patent/EP3590076A1/en
Priority to CN201780088904.6A priority patent/CN110462637B/en
Publication of WO2018171899A1 publication Critical patent/WO2018171899A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present invention relates to the field of machine learning or deep learning based on neural networks. More specifically, the present invention relates to a neural network data processing apparatus and method, in particular for processing data in the fields of audio processing, computer vision, image or video processing, classification, detection and/or recognition.
  • BACKGROUND Weighted aggregation which is commonly used in many signal processing applications, such as image processing methods for image quality improvement, depth or disparity estimation and many other applications [Kaiming He, Jian Sun, Xiaoou Tang, "Guided Image Filtering", ECCV 2010], is a process in which input data is being combined to pack information present in a larger spatial area into one single spatial position, with additional input in form of aggregation weights that control the influence of each input data value on the result.
  • convolutional neural networks In deep-learning, a common approach recently used in many application fields is the utilization of convolutional neural networks. Generally, a specific part of such convolutional neural networks is at least one convolution (or convolutional) layer which performs a convolution of input data values with a learned kernel K producing one output data value per convolution kernel for each output position [J. Long, E. Shelhamer, T. Darrell, "Fully Convolutional Networks for Semantic Segmentation", CVPR 2015].
  • the convolution using the learned kernel K can be expressed mathematically as follows: wherein out(x, y) denotes the array of output data values, in(x - i, y - ; ' ) denotes a sub- array of an array of input data values and K(i,j) denotes the kernel comprising an array of kernel weights or kernel values of size (2r+1 ) x(2r+1 ). B denotes an optional learned bias term, which can be added for obtaining each output data value.
  • the weights of the kernel K are the same for the whole array of input data values in(x, y) and are generally learned during a learning phase of the neural network which, in case of 1 st order methods, consists of iteratively back-propagating the gradients of the neural network output back to the input layers and updating the weights of all the network layers by a partial derivative computed in this way.
  • embodiments of the invention provide a new approach for weighted
  • the neural network layer can compute aggregated data using individual aggregation weights that are learned for each individual spatial position.
  • Aggregation weights can be computed as a function of similarity features and learned weight kernels, resulting in individual aggregation weights for each output spatial position.
  • the invention relates to a data processing apparatus comprising one or more processors configured to provide a neural network.
  • the data to be processed by the data processing apparatus can be, for instance, two- dimensional image or video data or one-dimensional audio data.
  • the neural network provided by the one or more processors of the data processing apparatus comprises a neural network layer being configured to process an array of input data values, such as a two-dimensional array of input data values in(x, y), into an array of output data values, such as a two-dimensional array of output data values out(x, y).
  • the neural network layer can be a first layer or an intermediate layer of the neural network.
  • the array of input data values can be one-dimensional (i.e. a vector, e.g. audio or other e.g. temporal sequence), two-dimensional (i.e. a matrix, e.g. an image or other temporal or spatial sequence), or N-dimensional (e.g. any kind of N-dimensional feature array, e.g. provided by a conventional pre-processing or feature extraction and/or by other layers of the neural network).
  • the array of input data values can have one or more channels, e.g. for an RGB image one R-channel, one G-channel and one B-channel, or for a black/white image only one grey-scale or intensity channel.
  • channels can refer to any "feature”, e.g.
  • the array of input data values can comprise, for instance, two-dimensional RGB or grey scale image or video data representing at least a part of an image, or a one-dimensional audio signal.
  • the array of input data values can be, for instance, any kind of array of features generated by previous layers of the neural network on the basis of an initial, e.g. original array of input data values, e.g. by means of a feature extraction.
  • the neural network layer is configured to generate from the array of input data values the array of output data values on the basis of a plurality of position dependent, i.e. spatially variable kernels and a plurality of different sub-arrays at different positions of the array of input data values.
  • Each kernel comprises a plurality of kernel values or kernel weights.
  • a respective kernel is applied to a respective sub-array of the array of input data values to generate a single output data value.
  • a "position dependent kernel” as used herein means a kernel whose kernel weights depend on the respective position, e.g. (x,y) for two-dimensional arrays, of the sub-array of input data values.
  • the kernel values applied to a first sub-array of the array of input data values can differ from the kernel values of a second kernel applied to a second sub-array of the array of input data values.
  • the position could be a spatial position defined, for instance, by two spatial coordinates.
  • the position In a one-dimensional array the position could be a temporal position defined, for instance, by a time coordinate.
  • the data processing apparatus allows to aggregate the input data in a way that can better reflect mutual data similarity, i.e. the resultant output data value is more strongly influenced by input data values that are closer and more similar to input data in the center position of the kernel. Moreover, the data processing apparatus allows adapting the kernel weights for different spatial positions of the array of input data values. This, in turn, allows, for instance, minimizing the influence of some of the input data values on the result, for instance the input data values that are associated with another part of the scene (as determined by semantic segmentation) or a different object that is being analysed.
  • the neural network comprises at least one additional network layer configured to generate the plurality of position dependent kernels on the basis of an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values.
  • the original array of original input values can be the array of input data values or a different array.
  • the neural network is configured to generate the plurality of position dependent kernels based on a plurality of learned position independent kernels and a plurality of position dependent weights.
  • the position independent kernels can be learned by the neural network and the position dependent weights or similarity features can be computed, for instance, by a further preceding layer of the neural network.
  • This implementation form allows minimizing the amount of data being transferred to the neural network layer in order to obtain the kernel values. This is because the kernel values are not transferred directly, but computed from the plurality of position dependent weights and/or similarity features substantially reducing the amount of data for each element of the array of output data values. This can minimize the amount of data being stored and transferred by the neural network between the different network layers, which is especially important during the learning process on the basis of the mini-batch approach as the memory of the data processing apparatus (GPU) is currently the main bottleneck.
  • GPU data processing apparatus
  • the neural network is configured to generate a kernel of the plurality of position dependent kernels by adding the learned position independent kernels each weighted by the associated non-learned position dependent weights (i.e. similarity features).
  • This implementation form provides a very efficient representation of the plurality of position dependent kernels using a linear combination of position independent "base kernels”.
  • the plurality of position independent kernels are predetermined or learned, and wherein the neural network comprises at least one additional neural network layer or "conventional" pre-processing layer configured to generate the plurality of position dependent weights (i.e. similarity features) based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values.
  • the original array of original input values can be the array of input data values or a different array.
  • the at least one additional neural network layer or "conventional" pre- processing layer can generate the plurality of position dependent weights (i.e. similarity features) using, for instance, bilateral filtering, semantic segmentation, per-instance object detection, and data importance indicators like ROI (region of interest).
  • the array of input data values and the array of output data values are two-dimensional arrays
  • the convolutional neural network layer is configured to generate the plurality of position dependent kernels w L (x, y, on the basis of the following equation: wherein F f (x, y) denotes the set of Nf position dependent weights (i.e. similarity features) and denotes the plurality of position independent kernels.
  • the neural network layer is a convolutional network layer or an aggregation network layer.
  • the array of input data values and the array of output data values are two-dimensional arrays, wherein the array input data values in(x, y, ) has C t different channels and wherein the neural network layer is a convolutional network layer configured to generate the array of output data values out x, y, c 0 ) on the basis of the following equations:
  • r denotes a size of each kernel of the plurality of position dependent kernels denotes a normalization factor.
  • the normalization factor can be set equal to 1 .
  • the array of input data values and the array of output data values are two-dimensional arrays, wherein the array input data values in(x, y) has only a single channel and wherein the neural network layer is a aggregation network layer configured to generate the array of output data values out(x, y) on the basis of the following equations:
  • r denotes a size of each kernel of the plurality of position dependent kernels w L (x, y, i,j) and W L (x, y) denotes a normalization factor.
  • W L (x, y) can be set equal to 1 .
  • the neural network layer is a correlation network layer configured to generate the array of output data values from the array of input data values and a further array of input data values by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels, or by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels associated to the array of input data values and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values.
  • the array of input data values inl(x, y), the further array of input data values in2(x, y) and the plurality of position dependent kernels w L1 (x, y, i,fi are two-dimensional arrays and wherein the correlation neural network layer is configured to generate the array of output data values out(x, y) on the basis of the following equations:
  • W L (x, y) denotes a normalization factor.
  • W L (x, y) can be set equal to 1 .
  • the array of input data values inl ⁇ x, y), the further array of input data values in2(x, y), the plurality of position dependent kernels w L1 (x, y, and the plurality of further position dependent kernels w L2 (x, y, are two- dimensional arrays and wherein the correlation neural network layer is configured to generate the array of output data values out(x, y) on the basis of the following equations:
  • r denotes a size of each kernel of the plurality of position dependent kernels w L1 (x, y, and of each kernel of the plurality of further position dependent kernels w L2 (x, y, and W L12 (x, y) denotes a normalization factor.
  • the normalization factor can be set equal to 1 .
  • the neural network layer is configured to generate a respective output data value of the array of output data values by determining a respective input data value of a respective sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the respective determined input data value as the respective output data value.
  • the invention relates to a corresponding data processing method comprising the step of generating by a neural network layer of a neural network from an array of input data values an array of output data values based on a plurality of position dependent kernels and a plurality of sub-arrays of the array of input values.
  • the method comprises the further step of generating the position dependent kernel of the plurality of position dependent kernels by an additional neural network layer of the neural network based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values.
  • the step of generating the position dependent kernel of the plurality of position dependent kernels comprises generating the position dependent kernel of the plurality of position dependent kernels based on a plurality of position independent kernels and a plurality of position dependent weights.
  • the step of generating a kernel of the plurality of position dependent kernels comprises the step of adding, i.e. summing the position independent kernels weighted by the associated position dependent weights.
  • the step of generating the plurality of position dependent weights comprises the step of generating the plurality of position dependent weights by an additional neural network layer or a processing layer of the neural network based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values.
  • the array of input data values and the array of output data values are two-dimensional arrays, and the step of generating a kernel of the plurality of position dependent kernels is based on the following
  • F f (x, y) denotes the plurality of N f position dependent weights (i.e. similarity features) and denotes the plurality of position independent kernels.
  • the neural network layer is a convolutional network layer or an aggregation network layer.
  • the array of input data values and the array of output data values are two-dimensional arrays and the neural network layer is a convolutional network layer, wherein the step of generating the array of output data values is based on of the following equations:
  • the neural network layer is an aggregation network layer and wherein the step of generating the array of output data values is based on the following equations:
  • the normalization factors can be set
  • the neural network layer is a correlation network layer and the step of generating the array of output data values comprises generating the array of output data values from the array of input data values and a further array of input data values (a) by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels, or (b) by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels associated to the array of input data values and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values.
  • the array of input data values, the further array of input data values and the kernel are two-dimensional arrays and the step of generating the array of output data values by the correlation neural network layer is based on the following equations:
  • the normalization factors W L or W L12 can be set equal to 1 .
  • the step of generating an output data value of the array of output data valuesby the convolutional neural network layer comprises the steps of determining an input data value of a sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the determined input value as the output value.
  • the invention relates to a computer program comprising program code for performing the method according to the second aspect, when executed on a processor or a computer.
  • the invention can be implemented in hardware and/or software or in any combination thereof.
  • Fig. 1 shows a schematic diagram illustrating a data processing apparatus based on a neural network according to an embodiment
  • Fig. 2 shows a schematic diagram illustrating a neural network provided by a data processing apparatus according to an embodiment
  • Fig. 3 shows a schematic diagram illustrating the concept of down-stepping or aggregation of data implemented in a data processing apparatus according to an embodiment
  • Fig. 4 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment
  • Fig. 5 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment
  • Fig. 6 shows a schematic diagram illustrating different processing steps of a data processing apparatus according to an embodiment
  • Fig. 7 shows a schematic diagram illustrating a neural network provided by a data processing apparatus according to an embodiment
  • Fig. 8 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment
  • Fig. 9 shows a schematic diagram illustrating different processing steps of a data processing apparatus according to an embodiment
  • Fig. 10 shows a flow diagram illustrating a neural network data processing method according to an embodiment.
  • Figure 1 shows a schematic diagram illustrating a data processing apparatus 100 according to an embodiment configured to process data on the basis of a neural network.
  • the data processing apparatus 100 shown in figure 1 comprises a processor 101 .
  • the data processing apparatus 100 can be implemented as a distributed data processing apparatus 100 comprising more than the one processor 101 shown in figure 1 .
  • the processor 101 of the data processing apparatus 100 is configured to provide a neural network 1 10.
  • the neural network 1 10 comprises a neural network layer being configured to generate from an array of input data values an array of output data values based on a plurality of sub-arrays of the array of input data values and a plurality of position dependent kernels comprising a plurality of kernel values or kernel weights.
  • the data processing apparatus 100 can further comprise a memory 103 for storing and/or retrieving the input data values, the output data values and/or the kernel values.
  • Each position dependent kernel comprises a plurality of kernel values or kernel weights. For a respective position or element of the array of input data values a respective kernel is applied to a respective sub-array of the array of input data values to generate a single output data value.
  • a "position dependent kernel” as used herein means a kernel whose kernel weights depend on the respective position of the sub-array of the array of input data values to which the kernel is applied. In other words, for a first kernel applied to a first sub-array of the plurality of input data values the kernel values can differ from the kernel values of a second kernel applied to a second sub-array of the plurality of input data values forming a different sub-array of the same array of input values.
  • the position could be a spatial position defined, for instance, by two spatial coordinates (x,y).
  • the position could be a temporal position defined, for instance, by a time coordinate (t).
  • the array of input data values can be one-dimensional (i.e. a vector, e.g. audio or other e.g. temporal sequence), two-dimensional (i.e. a matrix, e.g. an image or other temporal or spatial sequence), or N-dimensional (e.g. any kind of N-dimensional feature array, e.g. provided by a conventional pre-processing or feature extraction and/or by other layers of the neural network 1 10).
  • the array of input data values can have one or more channels, e.g. for an RGB image one R-channel, one G-channel and one B-channel, or for a black/white image only one grey-scale or intensity channel.
  • channel can refer to any "feature", e.g.
  • the array of input data values can comprise, for instance, two-dimensional RGB or grey scale image or video data representing at least a part of an image, or a one- dimensional audio signal.
  • the array of input data values can be, for instance, an array of similarity features generated by previous layers of the neural network on the basis of an initial, e.g. original array of input data values, e.g. by means of a feature extraction, as will be described in more detail further below.
  • the neural network layer 120 can be
  • the neural network layer 120 can be implemented as a convolution (or convolutional) layer configured to "mix" all channels of the array of input data values. For instance, in case the array of input data values is an RGB image, i.e.
  • the position dependent kernels may be channel-specific or common for all channels.
  • the position dependent kernels are generally multi-channel kernels.
  • the neural network layer can be implemented as a correlation layer providing a combination of aggregation or convolution (input image and weighted kernel) and an additional image (i.e. correlation of two, e.g. of the same or at the same position, sub- arrays in the two images with each other and additional application of the position dependent kernel on the correlation result).
  • the position dependent kernels may be channel-specific or common for all channels.
  • FIG. 2 shows a schematic diagram illustrating elements of the neural network 1 10 provided by the data processing apparatus 100 according to an embodiment.
  • the neural network layer 120 is implemented as a weighted aggregation layer 120.
  • the neural network layer 120 can be implemented as a convolution network layer 120 (also referred to as convolutional network layer 120) or as a correlation network layer 120, as will be described in more detail further below.
  • the aggregation layer 120 is configured to generate a two-dimensional array of output data values out(x, y) 121 on the basis of a respective sub-array of the two-dimensional array of input data values in(x, y) 1 17 and a plurality of position dependent kernels 1 18 comprising a plurality of kernel values or kernel weights.
  • the weighted aggregation layer 120 of the neural network 1 10 shown in figure 2 is configured to generate the array of output data values out(x, y) 121 on the basis of the plurality of sub-arrays of the two-dimensional array of input data values in(x, y) 1 17 and the plurality of position dependent kernels 1 18 comprising the kernel values w L (x, y, i,j) using the following equation: wherein r denotes a size of each kernel of the plurality of position dependent kernels 1 18 (in this example, each kernel and each sub-array of the array of input values has
  • the normalization factor can be omitted, i.e. set to one.
  • the normalization factor can be omitted.
  • the normalization factor allows to keep the mean value or DC component. This can be advantageous, when the weighted aggregation layer 120 is used to aggregate stereo matching costs of a stereo image, because the normalization is beneficial for making the output values for different sub-arrays of the array of input data values comparable. This is usually not necessary in the case of the convolutional network layer 120.
  • the above equations for a two-dimensional input array and a kernel having a quadratic shape can be easily adapted to the case of an array of input values 1 17 having one dimension or more than two dimensions and/or a kernel having a rectangular shape, e.g. a non-square rectangular shape with different horizontal and vertical dimensions.
  • the neural network layer 120 can be configured to generate the array of output data values out x, y, c 0 ) 121 having one or more channels on the basis of the plurality of sub- arrays of the two-dimensional array of input data values in(x, y, c ) 1 17 in the different channels and the plurality of position dependent kernels 1 18 comprising the kernel values w L (x, y, c 0 , Ci, i,j) using the following equation: wherein C t denotes the number of channels of the array of input data values
  • the neural network layer 120 is configured to generate the array of output data values 121 with a smaller size than the array of input data values 1 17.
  • the neural network 1 10 is configured to perform a down-step operation on the basis of the plurality of position dependent kernels 1 18.
  • Figure 3 illustrates a down step operation provided by the neural network 1 10 of the data processing apparatus 100 according to an embodiment.
  • Using a down step operation allows increasing the receptive field, enables processing the data with a cascade of smaller filters as compared with a single layer with a kernel covering an equal receptive field, and also enables the neural network 1 10 to better analyze the data by finding more sophisticated relationships among the data and adding more non-linearities to the processing chain by separating each convolution layer with a non-linear element like a sigmoid or a Rectified Linear Unit (ReLU).
  • ReLU Rectified Linear Unit
  • the neural network layer 120 can combine the input data values to produce the array of output data values with a reduced resolution. This can be achieved by convolving the array of input data values 1 17 with the position dependent kernels 1 18 with a stride S greater than 1 .
  • the stride S specifies the spacing between neighboring input spatial positions for which convolutions are computed. If the stride S is equal to 1 , the convolution is performed for each spatial position. If the stride S is greater than 1 , the neural network layer 120 is configured to perform a convolution for every S-th spatial position of the array of input data values 1 17, thereby reducing the output resolution (i.e. the dimensions of the array of output data values 121 by a factor of S for each spatial dimension).
  • the horizontal and the vertical stride may be the same or different.
  • the neural network layer 120 combines the array of input data values 1 17 from the spatial area of size (2r+1 )x(2r+1 ) to produce a respective output data value of the array of output data values 121 .
  • the input data values 1 17 can be aggregated to pack information present in a larger spatial area into one single spatial position.
  • the neural network 1 10 comprises one or more preceding layers 1 15 preceding the neural network layer 120 and one or more following layers 125 following the neural network layer 120.
  • the neural network layer 120 could be the first and/or the last data processing layer of the neural network 1 10, i.e. in an embodiment there could be no preceding layers 1 15 and/or no following layers 125.
  • the one or more preceding layers 1 15 can be further neural network layers and/or "conventional" pre-processing layers, such as a feature extraction layer.
  • the one or more following layers 125 can be further neural network layers, such as a deconvolutional layer, and/or "conventional" post-processing layers.
  • one or more of the preceding layers 1 15 can be configured to provide, i.e. to generate the plurality of position dependent kernels 1 18 (see the bottom signal path of the preceding layers 1 15 from guiding data 1 13 to the position dependent kernels 1 18 in Fig. 2).
  • the one or more layers of the preceding layers 1 15 can generate the plurality of position dependent kernels 1 18 on the basis of an original array 1 1 1 of original input data values, e.g. an original image as 2D example.
  • the original array 1 1 1 of original input data values can be an array of input data 1 1 1 being the original input of the neural network 1 10,.
  • the one or more preceding layers 1 15 could be configured to generate just the plurality of position dependent kernels 1 18 on the basis of the original input data 1 1 1 of the neural network 1 10 and to provide the original input data 1 1 1 of the neural network 1 10 as the array of input data values 1 17 to the neural network layer 120 (no preceding layers in the top signal path of the preceding layers 1 15 from the original input array 1 1 1 to the input array 1 17 according to an embodiment of the neural network layer 120, see Fig. 2).
  • the original array 1 1 1 may form the input array 1 17.
  • the one or more preceding layers 1 15 of the neural network 1 10 are configured to generate the plurality of position dependent kernels 1 18 on the basis of an array of guiding data 1 13.
  • a more detailed view of the processing steps of the neural network 1 10 of the data processing apparatus 100 according to such an embodiment is shown in figure 4 for the exemplary case of two- dimensional input and output arrays.
  • the array of guiding data 1 13 is used by the one or more preceding layers 1 15 of the neural network 1 10 to generate the plurality of position dependent kernels w L (x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13.
  • the neural network layer 120 is configured to generate the two-dimensional array of output data values out(x, y) 121 on the basis of the two-dimensional array of input data values in(x, y) 1 17 and the plurality of position dependent kernels w L (x, y) 1 18, which, in turn, are based on the array of guiding data g x, y 1 13.
  • the one or more preceding layers 1 15 of the neural network 1 10 are neural network layers configured to learn the plurality of position dependent kernels w i,( y) 1 18 on the basis of the array of guiding data gO, y) 1 13.
  • the one or more preceding layers 1 15 of the neural network 1 10 are pre-processing layers configured to generate the plurality of position dependent kernels w L (x, y) 1 18 on the basis of the array of guiding data 1 13 using one or more pre-processing schemes, such as feature extraction.
  • the one or more preceding layers 1 15 of the neural network 1 10 are configured to generate the plurality of position dependent kernels w L (x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13 in a way analogous to bilateral filtering, as illustrated in figure 5.
  • Bilateral filtering is known in the field of image processing for performing a weighted aggregation of input data, while decreasing the influence of some input values and amplifying the influence of other input values on the aggregation result [M. Elad, "On the origin of bilateral filter and ways to improve it", IEEE Transactions on Image Processing, vol. 1 1 , no. 10, pp. 1 141 -1 151 , October 2002].
  • the weights 518 utilized for aggregating the array of input data values 517 adapt to input data 517 using the guiding image data g 513 which provides additional information to control the aggregation process.
  • the array of guiding image data 513 can be equal to the array of input data values for generating the array of output data values 521 by the layer 520 on the basis of weights 518.
  • the bilateral filter weights 518 take into consideration the distance of the value within the kernel from the center of the kernel and, additionally, the similarity of the data values with data in the center of the kernel, as mathematically described by the following equation: wherein the normalization factor is based on the following equation:
  • the bilateral filter weights 518 are defined by the following equation: wherein d . , . ) denotes a distance function.
  • Figure 6 shows a schematic diagram highlighting the main processing stage 601 of the data processing apparatus 100 according to an embodiment, for instance, the data processing apparatus 100 providing the neural network 1 10 shown in figure 2.
  • the neural network 1 10 in a first processing step 603 can generate the plurality of position dependent kernels w L (x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13.
  • the neural network 1 10 in a second processing step 605 the neural network 1 10 can generate the array of output data values out(x, y) 121 on the basis of the array of input data values in(x, y) 1 17 and the plurality of position dependent kernels w L (x, y, 1 18.
  • Figure 7 shows a schematic diagram illustrating the neural network 1 10 provided by the data processing apparatus 100 according to a further embodiment.
  • the neural network 1 10 is configured to generate the plurality of position dependent kernels based on a plurality of position independent kernels 1 19b (shown in figure 8) and a plurality of position dependent weights F f (x, y) 1 19a (also referred to as similarity features 1 19).
  • the similarity features 1 19a are obtained based on the guiding data 1 13 and could indicate higher-level knowledge about the input data 1 1 1 , including e.g.
  • the neural network 1 10 of figure 7 is configured to generate the plurality of position dependent kernels 1 18 by adding the position independent kernels 1 19b each weighted by the associated position dependent weights F f (x, y) 1 19a.
  • the plurality of position independent kernels 1 19b can be any one of the plurality of position independent kernels 1 19b.
  • the neural network 1 10 can comprise one or more preceding layers 1 15, which precede the neural network layer 120 and which can be implemented as an additional neural network layer or a pre-processing layer.
  • one or more layers of the preceding layers 1 15 are configured to generate the plurality of position dependent weights F f (x, y) 1 19a on the basis of an original array of original input data values or the guiding data 1 13.
  • the original array of original input data values of the neural network 1 10 can comprise the array of input data values 1 17 to be processed by the neural network layer 120 or another array of input data values 1 1 1 associated to the array of input data values 1 17, for instance, the initial or original array of input data 1 1 1 .
  • the array of input data values in(x, y) 1 17 and the array of output data values out(x, y) 121 are two-dimensional arrays and the neural network layer 120 is configured to generate a respective kernel of the plurality of position dependent kernels w L (x, y, i,j) 1 18 on the basis of the following equation: wherein F f (x, y) denotes the set of N f position dependent weights (or similarity features) 1 19a and K f denotes the plurality of position independent kernels 1 19b, as also illustrated in figure 8.
  • Figure 9 shows a schematic diagram highlighting the main processing stage 901 of the data processing apparatus 100 according to an embodiment, for instance, the data processing apparatus 100 providing the neural network 1 10 illustrated in figures 7 and 8.
  • the neural network 1 10 in a first processing step 903 can generate the plurality of position dependent weights or similarity features F f (x, y) 1 19a on the basis of the array of guiding data g(x, y) 1 13.
  • the neural network 1 10 can generate the plurality of position dependent kernels w L (x, y, 1 18 on the basis of the plurality of position dependent weights or similarity features Ff ⁇ x, y) 1 19a and the plurality of position independent kernels K f 1 19b.
  • the neural network layer 120 can generate the array of output data values out(x, y) 121 on the basis of the array of input data values in(x, y) 1 17 and the plurality of position dependent kernels w L (x, y, i,j) 1 18.
  • the neural network layer 120 of the neural network 1 10 can be implemented in the form of a correlation network layer 120 configured to generate the array of output data values 121 from the array of input data values 1 17 and a further array of input data values by correlating the array of input data values 1 17 with the further array of input data values and by applying a respective position dependent kernel of the plurality of position dependent kernels 1 18 to a respective sub-array of the array of input data values 1 17 and a corresponding sub-array of the further array of input data values.
  • the correlation neural network layer 120 can be configured to generate the array of output data values out(x, y) 121 on the basis of the following equation: wherein inl(x - i, y - ; ' ) denotes the array of input data values 1 17, in2(x - i, y - ; ' ) denotes the further array of input data values, w L1 (x, y, i,fi denotes the plurality of position dependent kernels 1 18 and r denotes the size of each kernel of the plurality of position dependent kernels 1 18 (in this example, each kernel has (2r+1 ) * (2r+1 ) kernel values).
  • the output data values 121 can be normalized using the following normalization factor:
  • the normalization factor can be omitted, i.e. set to one.
  • the above equations for a two-dimensional input array and a kernel having a quadratic shape can be easily adapted to the case of an array of input values 1 17 having one dimension or more than two dimensions and/or a kernel having a non-square rectangular shape, i.e. different horizontal and vertical dimensions.
  • the correlation network layer 120 is configured to generate the array of output data values 121 from the array of input data values 1 17 and the further array of input data values by correlating the array of input data values 1 17 with the further array of input data values and by applying a respective position dependent kernel of the plurality of position dependent kernels 1 18 associated to the array of input data values 1 17 and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values to a respective sub-array of the array of input data values 1 17 and a corresponding sub-array of the further array of input data values.
  • the correlation neural network layer 120 can be configured to generate the array of output data values out(x, y) 121 on the basis of the following equation:
  • inl(x - i, y - ; ' ) denotes the array of input data values 1 17, in2(x - i, y - j) denotes the further array of input data values, w L1 (x, y, i,fi denotes the plurality of position dependent kernels 1 18, w L2 (x, y, denotes the plurality of further position dependent kernels associated to the further array of input data values.
  • the output data values 121 can be normalized using the following normalization factor:
  • the neural network layer 120 is configured to process the array of input data values 1 17 on the basis of the plurality of position dependent kernels 1 18 using a maximum or minimum pooling scheme. More specifically, in such an embodiment, the neural network layer 120 is configured to generate a respective output data value of the array of output data values 121 by determining a respective input data value of a respective sub-array of the plurality of sub-arrays of the array of input data values 1 17 being associated with a maximum or minimum kernel value of a respective position dependent kernel of the plurality of position dependent kernels 1 18 and using the respective determined input data value as the respective output data value.
  • the neural network layer 120 according to one of the
  • the neural network 1 10 can use e.g. object features derived from semantic segmentation as the guiding data 1 13 in order to determine the object borders in the scene and guide the aggregation process of the input stereo matching cost producing the aggregated stereo matching cost as an output.
  • Figure 10 shows a flow diagram illustrating a data processing method 1000 based on a neural network 1 10 according to an embodiment.
  • the data processing method 1000 can be performed by the data processing apparatus 100 shown in figure 1 and its different embodiments.
  • the data processing method 1000 comprises the step 1001 of generating by the neural network layer 120 of the neural network 1 10 from the array of input data values 1 17 the array of output data values 121 based on the plurality of position dependent kernels 1 18 and the plurality of sub-arrays of the array of input data values 1 17.
  • Embodiments of the data processing methods may be implemented and/or performed by one or more processors as described above.
  • a first kernel is considered different to a second kernel if a kernel value of the array of kernel values of the first kernel at at least one position (or of at least one element) of the first kernel is different from a kernel value of the array of kernel values of the second kernel at the same position (or of the same element) of the kernel.
  • the different sub-arrays of the array of input values have all the same size and dimension. Accordingly, the different kernels typically have the same size and dimension.
  • a first sub-array of the array of input values is considered different to a second sub-array of the array of input values if the first sub-array of the array of input values comprises at least one element of the array of input values which is not comprised by the second sub- array of the array of input values.
  • the different sub-arrays of the array of input values differ at least by one column or row of elements of the array of input values.
  • the different sub-arrays may partially overlap or not overlap as shown in Fig. 3.
  • Embodiments of the proposed guided aggregation can be applied for guided feature maps down-scaling.
  • input position dependent kernels as the guiding data
  • input values which are features of the feature map are grouped to form input data sub-arrays of the input data array and can be further aggregated in a controlled way producing an output feature value representative for the whole sub-array.
  • guiding data represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers or an edge map of a texture image corresponding to processed feature map.
  • Embodiments of the proposed guided convolution can be applied for switchable feature extraction.
  • Input values which are features of the feature map are convolved with adaptable feature extraction filters which are formed from the input guiding data in form of position dependent kernels.
  • each selected area of the input feature map can be processed with feature extraction filters producing only features desired for these regions.
  • guiding data in form of similarity features represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers, an edge map of a texture image corresponding to processed feature map or a ROI (region of interest) binary map.
  • Embodiments of the proposed guided correlation can be applied for guided correlation of input feature maps.
  • input position dependent kernels as the guiding data
  • input values which are features of the two or more feature maps are correlated together in a controlled way enabling amplification or attenuation of some features within a correlation region. This way, features that correspond to some other objects/regions in the feature map can be excluded or taken with smaller impact to compute the result. Also, some of the features characteristic for a selected region can be amplified.
  • guiding data represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers or an edge map of a texture image corresponding to processed feature map or a ROI (region of interest) binary map.
  • normalization is advantageous if the output values obtained for different spatial positions are going to be compared to each other per-value, without any intermediate step. As a result, preservation of the mean (DC) component is appreciated. If such comparison is not performed, normalization is not required but just increases complexity. Additionally, one can omit normalization in order to simplify the computations and compute only an approximate result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a data processing apparatus (100) comprising a processor (101) configured to provide a neural network (110), wherein the neural network (110) comprises a neural network layer (120) being configured to generate from an array of input data values (117) an array of output data values (121) based on a plurality of position dependent kernels (118) and a plurality of sub-arrays of the array of input data values (117). Moreover, the invention relates to a corresponding data processing method.

Description

DESCRIPTION
Neural network data processing apparatus and method TECHNICAL FIELD
Generally, the present invention relates to the field of machine learning or deep learning based on neural networks. More specifically, the present invention relates to a neural network data processing apparatus and method, in particular for processing data in the fields of audio processing, computer vision, image or video processing, classification, detection and/or recognition.
BACKGROUND Weighted aggregation, which is commonly used in many signal processing applications, such as image processing methods for image quality improvement, depth or disparity estimation and many other applications [Kaiming He, Jian Sun, Xiaoou Tang, "Guided Image Filtering", ECCV 2010], is a process in which input data is being combined to pack information present in a larger spatial area into one single spatial position, with additional input in form of aggregation weights that control the influence of each input data value on the result.
In deep-learning, a common approach recently used in many application fields is the utilization of convolutional neural networks. Generally, a specific part of such convolutional neural networks is at least one convolution (or convolutional) layer which performs a convolution of input data values with a learned kernel K producing one output data value per convolution kernel for each output position [J. Long, E. Shelhamer, T. Darrell, "Fully Convolutional Networks for Semantic Segmentation", CVPR 2015]. For the two- dimensional case used, for instance, in image processing the convolution using the learned kernel K can be expressed mathematically as follows:
Figure imgf000003_0001
wherein out(x, y) denotes the array of output data values, in(x - i, y - ;') denotes a sub- array of an array of input data values and K(i,j) denotes the kernel comprising an array of kernel weights or kernel values of size (2r+1 ) x(2r+1 ). B denotes an optional learned bias term, which can be added for obtaining each output data value. The weights of the kernel K are the same for the whole array of input data values in(x, y) and are generally learned during a learning phase of the neural network which, in case of 1 st order methods, consists of iteratively back-propagating the gradients of the neural network output back to the input layers and updating the weights of all the network layers by a partial derivative computed in this way.
SUMMARY
It is an object of the invention to provide an improved data processing apparatus and method based on neural networks.
The foregoing and other objects are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
Generally, embodiments of the invention provide a new approach for weighted
aggregation of data for neural networks that is implemented into a neural network as a new type of neural network layer. The neural network layer can compute aggregated data using individual aggregation weights that are learned for each individual spatial position.
Aggregation weights can be computed as a function of similarity features and learned weight kernels, resulting in individual aggregation weights for each output spatial position.
In this way a variety of sophisticated position dependent or position adaptive kernels learned by the neural network can be utilized for better adaptation of the aggregation weights to input data.
More specifically, according to a first aspect the invention relates to a data processing apparatus comprising one or more processors configured to provide a neural network. The data to be processed by the data processing apparatus can be, for instance, two- dimensional image or video data or one-dimensional audio data.
The neural network provided by the one or more processors of the data processing apparatus comprises a neural network layer being configured to process an array of input data values, such as a two-dimensional array of input data values in(x, y), into an array of output data values, such as a two-dimensional array of output data values out(x, y). The neural network layer can be a first layer or an intermediate layer of the neural network. The array of input data values can be one-dimensional (i.e. a vector, e.g. audio or other e.g. temporal sequence), two-dimensional (i.e. a matrix, e.g. an image or other temporal or spatial sequence), or N-dimensional (e.g. any kind of N-dimensional feature array, e.g. provided by a conventional pre-processing or feature extraction and/or by other layers of the neural network).
The array of input data values can have one or more channels, e.g. for an RGB image one R-channel, one G-channel and one B-channel, or for a black/white image only one grey-scale or intensity channel. The term "channel" can refer to any "feature", e.g.
features obtained from conventional pre-processing or feature extraction or from other neural networks or neural network layers of the same neural network. The array of input data values can comprise, for instance, two-dimensional RGB or grey scale image or video data representing at least a part of an image, or a one-dimensional audio signal. In case the neural network layer is implemented as an intermediate layer of the neural network, the array of input data values can be, for instance, any kind of array of features generated by previous layers of the neural network on the basis of an initial, e.g. original array of input data values, e.g. by means of a feature extraction.
The neural network layer is configured to generate from the array of input data values the array of output data values on the basis of a plurality of position dependent, i.e. spatially variable kernels and a plurality of different sub-arrays at different positions of the array of input data values. Each kernel comprises a plurality of kernel values or kernel weights. A respective kernel is applied to a respective sub-array of the array of input data values to generate a single output data value.
A "position dependent kernel" as used herein means a kernel whose kernel weights depend on the respective position, e.g. (x,y) for two-dimensional arrays, of the sub-array of input data values. In other words, for a first kernel the kernel values applied to a first sub-array of the array of input data values can differ from the kernel values of a second kernel applied to a second sub-array of the array of input data values. In a two- dimensional array the position could be a spatial position defined, for instance, by two spatial coordinates. In a one-dimensional array the position could be a temporal position defined, for instance, by a time coordinate. Thus, an improved data processing apparatus based on neural networks is provided. The data processing apparatus allows to aggregate the input data in a way that can better reflect mutual data similarity, i.e. the resultant output data value is more strongly influenced by input data values that are closer and more similar to input data in the center position of the kernel. Moreover, the data processing apparatus allows adapting the kernel weights for different spatial positions of the array of input data values. This, in turn, allows, for instance, minimizing the influence of some of the input data values on the result, for instance the input data values that are associated with another part of the scene (as determined by semantic segmentation) or a different object that is being analysed.
In a further implementation form of the first aspect, the neural network comprises at least one additional network layer configured to generate the plurality of position dependent kernels on the basis of an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values. The original array of original input values can be the array of input data values or a different array.
In a further implementation form of the first aspect, the neural network is configured to generate the plurality of position dependent kernels based on a plurality of learned position independent kernels and a plurality of position dependent weights. Generally, the position independent kernels can be learned by the neural network and the position dependent weights or similarity features can be computed, for instance, by a further preceding layer of the neural network. This implementation form allows minimizing the amount of data being transferred to the neural network layer in order to obtain the kernel values. This is because the kernel values are not transferred directly, but computed from the plurality of position dependent weights and/or similarity features substantially reducing the amount of data for each element of the array of output data values. This can minimize the amount of data being stored and transferred by the neural network between the different network layers, which is especially important during the learning process on the basis of the mini-batch approach as the memory of the data processing apparatus (GPU) is currently the main bottleneck.
In a further implementation form of the first aspect, the neural network is configured to generate a kernel of the plurality of position dependent kernels by adding the learned position independent kernels each weighted by the associated non-learned position dependent weights (i.e. similarity features). This implementation form provides a very efficient representation of the plurality of position dependent kernels using a linear combination of position independent "base kernels".
In a further implementation form of the first aspect, the plurality of position independent kernels are predetermined or learned, and wherein the neural network comprises at least one additional neural network layer or "conventional" pre-processing layer configured to generate the plurality of position dependent weights (i.e. similarity features) based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values. The original array of original input values can be the array of input data values or a different array. In an
implementation form the at least one additional neural network layer or "conventional" pre- processing layer can generate the plurality of position dependent weights (i.e. similarity features) using, for instance, bilateral filtering, semantic segmentation, per-instance object detection, and data importance indicators like ROI (region of interest).
In a further implementation form of the first aspect, the array of input data values and the array of output data values are two-dimensional arrays, and the convolutional neural network layer is configured to generate the plurality of position dependent kernels wL(x, y, on the basis of the following equation:
Figure imgf000007_0001
wherein Ff (x, y) denotes the set of Nf position dependent weights (i.e. similarity features) and denotes the plurality of position independent kernels.
Figure imgf000007_0002
In a further implementation form of the first aspect, the neural network layer is a convolutional network layer or an aggregation network layer.
In a further implementation form of the first aspect, the array of input data values and the array of output data values are two-dimensional arrays, wherein the array input data values in(x, y, ) has Ct different channels and wherein the neural network layer is a convolutional network layer configured to generate the array of output data values out x, y, c0) on the basis of the following equations:
Figure imgf000008_0001
wherein r denotes a size of each kernel of the plurality of position dependent kernels
Figure imgf000008_0003
denotes a normalization factor. In an implementation form the normalization factor can be set equal to 1 .
Figure imgf000008_0004
In a further implementation form of the first aspect, the array of input data values and the array of output data values are two-dimensional arrays, wherein the array input data values in(x, y) has only a single channel and wherein the neural network layer is a aggregation network layer configured to generate the array of output data values out(x, y) on the basis of the following equations:
Figure imgf000008_0002
wherein r denotes a size of each kernel of the plurality of position dependent kernels wL(x, y, i,j) and WL(x, y) denotes a normalization factor. In an implementation form the normalization factor WL(x, y) can be set equal to 1 .
In a further implementation form of the first aspect, the neural network layer is a correlation network layer configured to generate the array of output data values from the array of input data values and a further array of input data values by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels, or by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels associated to the array of input data values and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values. In a further implementation form of the first aspect, the array of input data values inl(x, y), the further array of input data values in2(x, y) and the plurality of position dependent kernels wL1(x, y, i,fi are two-dimensional arrays and wherein the correlation neural network layer is configured to generate the array of output data values out(x, y) on the basis of the following equations:
Figure imgf000009_0001
wherein r denotes a size of each kernel of the plurality of position dependent kernels wL{x, y, i,j) and WL(x, y) denotes a normalization factor. In an implementation form the normalization factor WL(x, y) can be set equal to 1 .
In a further implementation form of the first aspect, the array of input data values inl{x, y), the further array of input data values in2(x, y), the plurality of position dependent kernels wL1(x, y, and the plurality of further position dependent kernels wL2 (x, y, are two- dimensional arrays and wherein the correlation neural network layer is configured to generate the array of output data values out(x, y) on the basis of the following equations:
Figure imgf000009_0002
wherein r denotes a size of each kernel of the plurality of position dependent kernels wL1(x, y, and of each kernel of the plurality of further position dependent kernels wL2 (x, y, and WL12 (x, y) denotes a normalization factor. In an implementation form the normalization factor can be set equal to 1 .
In a further implementation form of the first aspect, the neural network layer is configured to generate a respective output data value of the array of output data values by determining a respective input data value of a respective sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the respective determined input data value as the respective output data value.
According to a second aspect the invention relates to a corresponding data processing method comprising the step of generating by a neural network layer of a neural network from an array of input data values an array of output data values based on a plurality of position dependent kernels and a plurality of sub-arrays of the array of input values.
In a further implementation form of the second aspect, the method comprises the further step of generating the position dependent kernel of the plurality of position dependent kernels by an additional neural network layer of the neural network based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values.
In a further implementation form of the second aspect, the step of generating the position dependent kernel of the plurality of position dependent kernels comprises generating the position dependent kernel of the plurality of position dependent kernels based on a plurality of position independent kernels and a plurality of position dependent weights.
In a further implementation form of the second aspect, the step of generating a kernel of the plurality of position dependent kernels comprises the step of adding, i.e. summing the position independent kernels weighted by the associated position dependent weights.
In a further implementation form of the second aspect, the plurality of position
independent kernels are predetermined or learned and the step of generating the plurality of position dependent weights comprises the step of generating the plurality of position dependent weights by an additional neural network layer or a processing layer of the neural network based on an original array of original input values of the neural network, wherein the original array of original input values of the neural network comprises the array of input values or another array of input values associated to the array of input values. In a further implementation form of the second aspect, the array of input data values and the array of output data values are two-dimensional arrays, and the step of generating a kernel of the plurality of position dependent kernels is based on the following
Figure imgf000011_0006
equation:
Figure imgf000011_0001
wherein Ff (x, y) denotes the plurality of Nf position dependent weights (i.e. similarity features) and denotes the plurality of position independent kernels.
Figure imgf000011_0005
In a further implementation form of the second aspect, the neural network layer is a convolutional network layer or an aggregation network layer.
In a further implementation form of the second aspect, the array of input data values and the array of output data values are two-dimensional arrays and the neural network layer is a convolutional network layer, wherein the step of generating the array of output data values is based on of the following equations:
Figure imgf000011_0002
or, wherein the neural network layer is an aggregation network layer and wherein the step of generating the array of output data values is based on the following equations:
Figure imgf000011_0003
In implementation forms, the the normalization factors can be set
Figure imgf000011_0004
equal to 1 . In a further implementation form of the second aspect, the neural network layer is a correlation network layer and the step of generating the array of output data values comprises generating the array of output data values from the array of input data values and a further array of input data values (a) by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels, or (b) by correlating the array of input data values with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels associated to the array of input data values and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values.
In a further implementation form of the second aspect, the array of input data values, the further array of input data values and the kernel are two-dimensional arrays and the step of generating the array of output data values by the correlation neural network layer is based on the following equations:
Figure imgf000012_0001
or
Figure imgf000012_0002
In any of the above implementation forms, the normalization factors WL or WL12 can be set equal to 1 .
In a further implementation form of the second aspect, the step of generating an output data value of the array of output data valuesby the convolutional neural network layer comprises the steps of determining an input data value of a sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the determined input value as the output value.
According to a third aspect the invention relates to a computer program comprising program code for performing the method according to the second aspect, when executed on a processor or a computer.
The invention can be implemented in hardware and/or software or in any combination thereof.
BRIEF DESCRIPTION OF THE DRAWINGS
Further embodiments of the invention will be described with respect to the following figures, wherein:
Fig. 1 shows a schematic diagram illustrating a data processing apparatus based on a neural network according to an embodiment;
Fig. 2 shows a schematic diagram illustrating a neural network provided by a data processing apparatus according to an embodiment;
Fig. 3 shows a schematic diagram illustrating the concept of down-stepping or aggregation of data implemented in a data processing apparatus according to an embodiment;
Fig. 4 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment;
Fig. 5 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment;
Fig. 6 shows a schematic diagram illustrating different processing steps of a data processing apparatus according to an embodiment; Fig. 7 shows a schematic diagram illustrating a neural network provided by a data processing apparatus according to an embodiment;
Fig. 8 shows a schematic diagram illustrating different aspects of a neural network provided by a data processing apparatus according to an embodiment;
Fig. 9 shows a schematic diagram illustrating different processing steps of a data processing apparatus according to an embodiment; and Fig. 10 shows a flow diagram illustrating a neural network data processing method according to an embodiment.
In the various figures, identical reference signs will be used for identical or at least functionally equivalent features.
DETAILED DESCRIPTION OF EMBODIMENTS
In the following description, reference is made to the accompanying drawings, which form part of the disclosure, and in which are shown, by way of illustration, specific aspects in which the present invention may be placed. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present invention. The following detailed description, therefore, is not to be taken in a limiting sense, as the scope of the present invention is defined by the appended claims. For instance, it is understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.
Figure 1 shows a schematic diagram illustrating a data processing apparatus 100 according to an embodiment configured to process data on the basis of a neural network. To this end, the data processing apparatus 100 shown in figure 1 comprises a processor 101 . In an embodiment, the data processing apparatus 100 can be implemented as a distributed data processing apparatus 100 comprising more than the one processor 101 shown in figure 1 . The processor 101 of the data processing apparatus 100 is configured to provide a neural network 1 10. As will be described in more detail further below, the neural network 1 10 comprises a neural network layer being configured to generate from an array of input data values an array of output data values based on a plurality of sub-arrays of the array of input data values and a plurality of position dependent kernels comprising a plurality of kernel values or kernel weights. As shown in figure 1 , the data processing apparatus 100 can further comprise a memory 103 for storing and/or retrieving the input data values, the output data values and/or the kernel values.
Each position dependent kernel comprises a plurality of kernel values or kernel weights. For a respective position or element of the array of input data values a respective kernel is applied to a respective sub-array of the array of input data values to generate a single output data value. A "position dependent kernel" as used herein means a kernel whose kernel weights depend on the respective position of the sub-array of the array of input data values to which the kernel is applied. In other words, for a first kernel applied to a first sub-array of the plurality of input data values the kernel values can differ from the kernel values of a second kernel applied to a second sub-array of the plurality of input data values forming a different sub-array of the same array of input values.
In a two-dimensional array the position could be a spatial position defined, for instance, by two spatial coordinates (x,y). In a one-dimensional array the position could be a temporal position defined, for instance, by a time coordinate (t).
The array of input data values can be one-dimensional (i.e. a vector, e.g. audio or other e.g. temporal sequence), two-dimensional (i.e. a matrix, e.g. an image or other temporal or spatial sequence), or N-dimensional (e.g. any kind of N-dimensional feature array, e.g. provided by a conventional pre-processing or feature extraction and/or by other layers of the neural network 1 10). The array of input data values can have one or more channels, e.g. for an RGB image one R-channel, one G-channel and one B-channel, or for a black/white image only one grey-scale or intensity channel. The term "channel" can refer to any "feature", e.g. features obtained from conventional pre-processing or feature extraction or from other neural networks or neural network layers of the neural network 1 10. The array of input data values can comprise, for instance, two-dimensional RGB or grey scale image or video data representing at least a part of an image, or a one- dimensional audio signal. In case the neural network layer 120 is implemented as an intermediate layer of the neural network 1 10, the array of input data values can be, for instance, an array of similarity features generated by previous layers of the neural network on the basis of an initial, e.g. original array of input data values, e.g. by means of a feature extraction, as will be described in more detail further below. As will be described in more detail below, the neural network layer 120 can be
implemented as an aggregation layer 120 configured to process each channel of the array of input data values separately, e.g. for an sub-array of the input array of R-values one (scalar) R-output value is generated. The position dependent kernels may be channel- specific or common for all channels. Moreover, the neural network layer 120 can be implemented as a convolution (or convolutional) layer configured to "mix" all channels of the array of input data values. For instance, in case the array of input data values is an RGB image, i.e. a multi-channel array, based on the three corresponding sub-arrays of the three input arrays (R,G and B) only one (scalar) output value is generated for the three channels (R,G and B) of the multi-channel array of input data values. The position dependent kernels may be channel-specific or common for all channels. In the case of a convolution layer 120 the position dependent kernels are generally multi-channel kernels. Furthermore, the neural network layer can be implemented as a correlation layer providing a combination of aggregation or convolution (input image and weighted kernel) and an additional image (i.e. correlation of two, e.g. of the same or at the same position, sub- arrays in the two images with each other and additional application of the position dependent kernel on the correlation result). Also in this case the position dependent kernels may be channel-specific or common for all channels.
Figure 2 shows a schematic diagram illustrating elements of the neural network 1 10 provided by the data processing apparatus 100 according to an embodiment. In the embodiment shown in figure 2, the neural network layer 120 is implemented as a weighted aggregation layer 120. In a further embodiment, the neural network layer 120 can be implemented as a convolution network layer 120 (also referred to as convolutional network layer 120) or as a correlation network layer 120, as will be described in more detail further below. As indicated in figure 2, in this embodiment the aggregation layer 120 is configured to generate a two-dimensional array of output data values out(x, y) 121 on the basis of a respective sub-array of the two-dimensional array of input data values in(x, y) 1 17 and a plurality of position dependent kernels 1 18 comprising a plurality of kernel values or kernel weights.
In an embodiment, the weighted aggregation layer 120 of the neural network 1 10 shown in figure 2 is configured to generate the array of output data values out(x, y) 121 on the basis of the plurality of sub-arrays of the two-dimensional array of input data values in(x, y) 1 17 and the plurality of position dependent kernels 1 18 comprising the kernel values wL(x, y, i,j) using the following equation:
Figure imgf000017_0001
wherein r denotes a size of each kernel of the plurality of position dependent kernels 1 18 (in this example, each kernel and each sub-array of the array of input values has
(2r+1 )*(2r+1 ) kernel values respectively input values) and the output data values can be normalized using the following normalization factor:
Figure imgf000017_0002
In other embodiments, the normalization factor can be omitted, i.e. set to one. For instance, in case the neural network layer 120 is implemented as a convolutional network layer the normalization factor can be omitted. For weighted aggregation the normalization factor allows to keep the mean value or DC component. This can be advantageous, when the weighted aggregation layer 120 is used to aggregate stereo matching costs of a stereo image, because the normalization is beneficial for making the output values for different sub-arrays of the array of input data values comparable. This is usually not necessary in the case of the convolutional network layer 120. As will be appreciated, the above equations for a two-dimensional input array and a kernel having a quadratic shape can be easily adapted to the case of an array of input values 1 17 having one dimension or more than two dimensions and/or a kernel having a rectangular shape, e.g. a non-square rectangular shape with different horizontal and vertical dimensions. For an embodiment, where the neural network layer 120 is implemented as a convolution network layer and the array of input data values in(x, y, c ) 1 17 is a two-dimensional array of input data values having more than one channel , such as in the case of RGB image data, the neural network layer 120 can be configured to generate the array of output data values out x, y, c0) 121 having one or more channels on the basis of the plurality of sub- arrays of the two-dimensional array of input data values in(x, y, c ) 1 17 in the different channels and the plurality of position dependent kernels 1 18 comprising the kernel values wL(x, y, c0, Ci, i,j) using the following equation:
Figure imgf000018_0001
wherein Ct denotes the number of channels of the array of input data values
Figure imgf000018_0003
1 17 and the output data values can be normalized using the following normalization factor:
Figure imgf000018_0002
In an embodiment, the neural network layer 120 is configured to generate the array of output data values 121 with a smaller size than the array of input data values 1 17. In other words, in an embodiment, the neural network 1 10 is configured to perform a down-step operation on the basis of the plurality of position dependent kernels 1 18. Figure 3 illustrates a down step operation provided by the neural network 1 10 of the data processing apparatus 100 according to an embodiment. Using a down step operation allows increasing the receptive field, enables processing the data with a cascade of smaller filters as compared with a single layer with a kernel covering an equal receptive field, and also enables the neural network 1 10 to better analyze the data by finding more sophisticated relationships among the data and adding more non-linearities to the processing chain by separating each convolution layer with a non-linear element like a sigmoid or a Rectified Linear Unit (ReLU).
In the down-step operation illustrated in figure 3 the neural network layer 120 can combine the input data values to produce the array of output data values with a reduced resolution. This can be achieved by convolving the array of input data values 1 17 with the position dependent kernels 1 18 with a stride S greater than 1 . The stride S specifies the spacing between neighboring input spatial positions for which convolutions are computed. If the stride S is equal to 1 , the convolution is performed for each spatial position. If the stride S is greater than 1 , the neural network layer 120 is configured to perform a convolution for every S-th spatial position of the array of input data values 1 17, thereby reducing the output resolution (i.e. the dimensions of the array of output data values 121 by a factor of S for each spatial dimension). The horizontal and the vertical stride may be the same or different. In the exemplary embodiment shown in figure 3, the neural network layer 120 combines the array of input data values 1 17 from the spatial area of size (2r+1 )x(2r+1 ) to produce a respective output data value of the array of output data values 121 . In this way, the input data values 1 17 can be aggregated to pack information present in a larger spatial area into one single spatial position.
In the embodiment shown in figure 2, the neural network 1 10 comprises one or more preceding layers 1 15 preceding the neural network layer 120 and one or more following layers 125 following the neural network layer 120. In an embodiment, the neural network layer 120 could be the first and/or the last data processing layer of the neural network 1 10, i.e. in an embodiment there could be no preceding layers 1 15 and/or no following layers 125.
In an embodiment, the one or more preceding layers 1 15 can be further neural network layers and/or "conventional" pre-processing layers, such as a feature extraction layer. Likewise, in an embodiment, the one or more following layers 125 can be further neural network layers, such as a deconvolutional layer, and/or "conventional" post-processing layers.
As shown in the embodiment shown in figure 2, one or more of the preceding layers 1 15 can be configured to provide, i.e. to generate the plurality of position dependent kernels 1 18 (see the bottom signal path of the preceding layers 1 15 from guiding data 1 13 to the position dependent kernels 1 18 in Fig. 2). In an embodiment, the one or more layers of the preceding layers 1 15 can generate the plurality of position dependent kernels 1 18 on the basis of an original array 1 1 1 of original input data values, e.g. an original image as 2D example. As indicated in figure 2, in an embodiment, the original array 1 1 1 of original input data values can be an array of input data 1 1 1 being the original input of the neural network 1 10,. In another embodiment, the one or more preceding layers 1 15 could be configured to generate just the plurality of position dependent kernels 1 18 on the basis of the original input data 1 1 1 of the neural network 1 10 and to provide the original input data 1 1 1 of the neural network 1 10 as the array of input data values 1 17 to the neural network layer 120 (no preceding layers in the top signal path of the preceding layers 1 15 from the original input array 1 1 1 to the input array 1 17 according to an embodiment of the neural network layer 120, see Fig. 2). In other words, according to an embodiment, the original array 1 1 1 may form the input array 1 17.
As indicated in figure 2, in a further embodiment the one or more preceding layers 1 15 of the neural network 1 10 are configured to generate the plurality of position dependent kernels 1 18 on the basis of an array of guiding data 1 13. A more detailed view of the processing steps of the neural network 1 10 of the data processing apparatus 100 according to such an embodiment is shown in figure 4 for the exemplary case of two- dimensional input and output arrays. The array of guiding data 1 13 is used by the one or more preceding layers 1 15 of the neural network 1 10 to generate the plurality of position dependent kernels wL(x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13. As already described in the context of figure 2, the neural network layer 120 is configured to generate the two-dimensional array of output data values out(x, y) 121 on the basis of the two-dimensional array of input data values in(x, y) 1 17 and the plurality of position dependent kernels wL(x, y) 1 18, which, in turn, are based on the array of guiding data g x, y 1 13. In an embodiment, the one or more preceding layers 1 15 of the neural network 1 10 are neural network layers configured to learn the plurality of position dependent kernels wi,( y) 1 18 on the basis of the array of guiding data gO, y) 1 13. In another embodiment, the one or more preceding layers 1 15 of the neural network 1 10 are pre-processing layers configured to generate the plurality of position dependent kernels wL(x, y) 1 18 on the basis of the array of guiding data 1 13 using one or more pre-processing schemes, such as feature extraction.
In an embodiment, the one or more preceding layers 1 15 of the neural network 1 10 are configured to generate the plurality of position dependent kernels wL(x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13 in a way analogous to bilateral filtering, as illustrated in figure 5. Bilateral filtering is known in the field of image processing for performing a weighted aggregation of input data, while decreasing the influence of some input values and amplifying the influence of other input values on the aggregation result [M. Elad, "On the origin of bilateral filter and ways to improve it", IEEE Transactions on Image Processing, vol. 1 1 , no. 10, pp. 1 141 -1 151 , October 2002]. As illustrated in figure 5, the weights 518 utilized for aggregating the array of input data values 517 adapt to input data 517 using the guiding image data g 513 which provides additional information to control the aggregation process. In an embodiment, the array of guiding image data 513 can be equal to the array of input data values for generating the array of output data values 521 by the layer 520 on the basis of weights 518. The bilateral filter weights 518 take into consideration the distance of the value within the kernel from the center of the kernel and, additionally, the similarity of the data values with data in the center of the kernel, as mathematically described by the following equation:
Figure imgf000021_0001
wherein the normalization factor is based on the following equation:
Figure imgf000021_0002
In an embodiment, the bilateral filter weights 518 are defined by the following equation:
Figure imgf000021_0003
wherein d . , . ) denotes a distance function.
Figure 6 shows a schematic diagram highlighting the main processing stage 601 of the data processing apparatus 100 according to an embodiment, for instance, the data processing apparatus 100 providing the neural network 1 10 shown in figure 2. As already described above, in a first processing step 603 the neural network 1 10 can generate the plurality of position dependent kernels wL(x, y) 1 18 on the basis of the array of guiding data g(x, y) 1 13. In a second processing step 605 the neural network 1 10 can generate the array of output data values out(x, y) 121 on the basis of the array of input data values in(x, y) 1 17 and the plurality of position dependent kernels wL (x, y, 1 18. Figure 7 shows a schematic diagram illustrating the neural network 1 10 provided by the data processing apparatus 100 according to a further embodiment. As will be described in more detail in the following, the main difference to the embodiment shown in figure 2 is that in the embodiment shown in figure 7 the neural network 1 10 is configured to generate the plurality of position dependent kernels based on a plurality of position independent kernels 1 19b (shown in figure 8) and a plurality of position dependent weights Ff (x, y) 1 19a (also referred to as similarity features 1 19). In an embodiment, the similarity features 1 19a are obtained based on the guiding data 1 13 and could indicate higher-level knowledge about the input data 1 1 1 , including e.g. semantic segmentation, per-instance object detection, data importance indicators like ROI (Region of Interest) and many others - all learned by the neural network 1 10 itself or being an additional input to the neural network 1 10. In an embodiment, the neural network 1 10 of figure 7 is configured to generate the plurality of position dependent kernels 1 18 by adding the position independent kernels 1 19b each weighted by the associated position dependent weights Ff (x, y) 1 19a.
In an embodiment, the plurality of position independent kernels 1 19b can be
predetermined or learned by the neural network 1 10. As illustrated in figure 7, also in this embodiment the neural network 1 10 can comprise one or more preceding layers 1 15, which precede the neural network layer 120 and which can be implemented as an additional neural network layer or a pre-processing layer. In an embodiment, one or more layers of the preceding layers 1 15 are configured to generate the plurality of position dependent weights Ff (x, y) 1 19a on the basis of an original array of original input data values or the guiding data 1 13. The original array of original input data values of the neural network 1 10 can comprise the array of input data values 1 17 to be processed by the neural network layer 120 or another array of input data values 1 1 1 associated to the array of input data values 1 17, for instance, the initial or original array of input data 1 1 1 . In the exemplary embodiment shown in figure 7, the array of input data values in(x, y) 1 17 and the array of output data values out(x, y) 121 are two-dimensional arrays and the neural network layer 120 is configured to generate a respective kernel of the plurality of position dependent kernels wL(x, y, i,j) 1 18 on the basis of the following equation:
Figure imgf000023_0002
wherein Ff (x, y) denotes the set of Nf position dependent weights (or similarity features) 1 19a and Kf denotes the plurality of position independent kernels 1 19b, as also illustrated in figure 8.
Figure 9 shows a schematic diagram highlighting the main processing stage 901 of the data processing apparatus 100 according to an embodiment, for instance, the data processing apparatus 100 providing the neural network 1 10 illustrated in figures 7 and 8. As already described above, in a first processing step 903 the neural network 1 10 can generate the plurality of position dependent weights or similarity features Ff (x, y) 1 19a on the basis of the array of guiding data g(x, y) 1 13. In a second processing step 905 the neural network 1 10 can generate the plurality of position dependent kernels wL(x, y, 1 18 on the basis of the plurality of position dependent weights or similarity features Ff {x, y) 1 19a and the plurality of position independent kernels Kf 1 19b. In a further step (not shown in figure 9, but similar to the processing step 605 shown in figure 6) the neural network layer 120 can generate the array of output data values out(x, y) 121 on the basis of the array of input data values in(x, y) 1 17 and the plurality of position dependent kernels wL(x, y, i,j) 1 18.
As already mentioned above, in an embodiment the neural network layer 120 of the neural network 1 10 can be implemented in the form of a correlation network layer 120 configured to generate the array of output data values 121 from the array of input data values 1 17 and a further array of input data values by correlating the array of input data values 1 17 with the further array of input data values and by applying a respective position dependent kernel of the plurality of position dependent kernels 1 18 to a respective sub-array of the array of input data values 1 17 and a corresponding sub-array of the further array of input data values. In case the array of input data values 1 17, the further array of input data values and the plurality of position dependent kernels 1 18 are respective two-dimensional arrays (as in the embodiments shown in figure 2 and 7), the correlation neural network layer 120 can be configured to generate the array of output data values out(x, y) 121 on the basis of the following equation:
Figure imgf000023_0001
wherein inl(x - i, y - ;') denotes the array of input data values 1 17, in2(x - i, y - ;') denotes the further array of input data values, wL1(x, y, i,fi denotes the plurality of position dependent kernels 1 18 and r denotes the size of each kernel of the plurality of position dependent kernels 1 18 (in this example, each kernel has (2r+1 )*(2r+1 ) kernel values). The output data values 121 can be normalized using the following normalization factor:
Figure imgf000024_0002
In other embodiments, the normalization factor can be omitted, i.e. set to one. As will be appreciated, the above equations for a two-dimensional input array and a kernel having a quadratic shape can be easily adapted to the case of an array of input values 1 17 having one dimension or more than two dimensions and/or a kernel having a non-square rectangular shape, i.e. different horizontal and vertical dimensions.
In a further embodiment, the correlation network layer 120 is configured to generate the array of output data values 121 from the array of input data values 1 17 and the further array of input data values by correlating the array of input data values 1 17 with the further array of input data values and by applying a respective position dependent kernel of the plurality of position dependent kernels 1 18 associated to the array of input data values 1 17 and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values to a respective sub-array of the array of input data values 1 17 and a corresponding sub-array of the further array of input data values. In case the array of input data values 1 17 and the further array of input data values are respective two-dimensional arrays (as in the embodiments shown in figure 2 and 7), the correlation neural network layer 120 can be configured to generate the array of output data values out(x, y) 121 on the basis of the following equation:
Figure imgf000024_0001
wherein inl(x - i, y - ;') denotes the array of input data values 1 17, in2(x - i, y - j) denotes the further array of input data values, wL1(x, y, i,fi denotes the plurality of position dependent kernels 1 18, wL2(x, y, denotes the plurality of further position dependent kernels associated to the further array of input data values. The output data values 121 can be normalized using the following normalization factor:
Figure imgf000025_0001
In a further embodiment, the neural network layer 120 is configured to process the array of input data values 1 17 on the basis of the plurality of position dependent kernels 1 18 using a maximum or minimum pooling scheme. More specifically, in such an embodiment, the neural network layer 120 is configured to generate a respective output data value of the array of output data values 121 by determining a respective input data value of a respective sub-array of the plurality of sub-arrays of the array of input data values 1 17 being associated with a maximum or minimum kernel value of a respective position dependent kernel of the plurality of position dependent kernels 1 18 and using the respective determined input data value as the respective output data value.
In a further embodiment, the neural network layer 120 according to one of the
embodiments described above is used by the neural network 1 10 to perform weighted aggregation of stereo matching costs in order to obtain a depth map from a stereo image. Cost aggregation is a commonly used approach as a method to minimize noise and improve the depth estimation results. Without additional weighting object borders at depth discontinuities would normally be over-smoothed. Consequently, a much desired feature is to preserve these borders by taking into account some additional knowledge about object borders in the scene. Thus, advantageously, the neural network layer 120 can use e.g. object features derived from semantic segmentation as the guiding data 1 13 in order to determine the object borders in the scene and guide the aggregation process of the input stereo matching cost producing the aggregated stereo matching cost as an output.
Figure 10 shows a flow diagram illustrating a data processing method 1000 based on a neural network 1 10 according to an embodiment. The data processing method 1000 can be performed by the data processing apparatus 100 shown in figure 1 and its different embodiments. The data processing method 1000 comprises the step 1001 of generating by the neural network layer 120 of the neural network 1 10 from the array of input data values 1 17 the array of output data values 121 based on the plurality of position dependent kernels 1 18 and the plurality of sub-arrays of the array of input data values 1 17. Embodiments of the data processing methods may be implemented and/or performed by one or more processors as described above.
Referring to back to the various embodiments described above, a first kernel is considered different to a second kernel if a kernel value of the array of kernel values of the first kernel at at least one position (or of at least one element) of the first kernel is different from a kernel value of the array of kernel values of the second kernel at the same position (or of the same element) of the kernel. Typically a kernel has the same size (number of elements, positions or values per dimension) and dimension (number of dimensions N, N>=1 ) as the sub-array of the array of input values it is applied to. Typically the different sub-arrays of the array of input values have all the same size and dimension. Accordingly, the different kernels typically have the same size and dimension.
A first sub-array of the array of input values is considered different to a second sub-array of the array of input values if the first sub-array of the array of input values comprises at least one element of the array of input values which is not comprised by the second sub- array of the array of input values. Typically the different sub-arrays of the array of input values differ at least by one column or row of elements of the array of input values. The different sub-arrays may partially overlap or not overlap as shown in Fig. 3.
In the following some further details about various aspects and embodiments (aggregation network layer, convolution network layer, correlation network layer and normalization) are provided. Aggregation
Embodiments of the proposed guided aggregation can be applied for guided feature maps down-scaling. By using input position dependent kernels as the guiding data, input values which are features of the feature map are grouped to form input data sub-arrays of the input data array and can be further aggregated in a controlled way producing an output feature value representative for the whole sub-array. This way, by changing the resolution of the input feature map object borders and other details that are normally lost while down-scaling, can be better preserved. In such cases, guiding data represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers or an edge map of a texture image corresponding to processed feature map. Convolution
Embodiments of the proposed guided convolution can be applied for switchable feature extraction. Input values which are features of the feature map are convolved with adaptable feature extraction filters which are formed from the input guiding data in form of position dependent kernels. This way, each selected area of the input feature map can be processed with feature extraction filters producing only features desired for these regions. Here, guiding data in form of similarity features represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers, an edge map of a texture image corresponding to processed feature map or a ROI (region of interest) binary map.
Correlation
Embodiments of the proposed guided correlation can be applied for guided correlation of input feature maps. By using input position dependent kernels as the guiding data, input values which are features of the two or more feature maps are correlated together in a controlled way enabling amplification or attenuation of some features within a correlation region. This way, features that correspond to some other objects/regions in the feature map can be excluded or taken with smaller impact to compute the result. Also, some of the features characteristic for a selected region can be amplified. Here, guiding data represents information about object or region borders, obtained by e.g. color-based segmentation, semantic segmentation using preceding neural network layers or an edge map of a texture image corresponding to processed feature map or a ROI (region of interest) binary map.
Normalization
In general, normalization is advantageous if the output values obtained for different spatial positions are going to be compared to each other per-value, without any intermediate step. As a result, preservation of the mean (DC) component is appreciated. If such comparison is not performed, normalization is not required but just increases complexity. Additionally, one can omit normalization in order to simplify the computations and compute only an approximate result.
While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations or embodiments, such feature or aspect may be combined with one or more other features or aspects of the other implementations or embodiments as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other.
Although specific aspects have been illustrated and described herein, it will be
appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.
Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.
Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims

1 . A data processing apparatus (100) comprising: a processor (101 ) configured to provide a neural network (1 10), wherein the neural network (1 10) comprises a neural network layer (120) being configured to generate from an array of input data values (1 17) an array of output data values (121 ) based on a plurality of position dependent kernels (1 18; 1 19) and a plurality of sub-arrays of the array of input data values (1 17).
2. The data processing apparatus (100) of claim 1 , wherein the neural network (1 10) comprises an additional neural network layer (1 15) configured to generate the plurality of position dependent kernels (1 18) based on an original array of original input data values (1 1 1 , 1 17) of the neural network (1 10), wherein the original array of original input data values (1 1 1 , 1 17) of the neural network (1 10) comprises the array of input data values (1 17) or another array of input data values (1 17) associated to the array of input data values (1 1 1 ).
3. The data processing apparatus (100) of claim 1 or 2, wherein the neural network (1 10) is configured to generate the plurality of position dependent kernels (1 18) based on a plurality of position independent kernels (1 19b) and a plurality of position dependent weights (1 19a).
4. The data processing apparatus (100) of claim 3, wherein the neural network (1 10) is configured to generate a kernel of the plurality of position dependent kernels (1 18) by adding the position independent kernels (1 19b) weighted by the associated position dependent weights (1 19a).
5. The data processing apparatus (100) of claim 3 or 4, wherein the plurality of position independent kernels (1 18) are predetermined or learned and wherein the neural network (1 10) comprises an additional neural network layer (1 15) or processing layer (1 15) configured to generate the plurality of position dependent weights (1 19a) based on an original array of original input data values (1 1 1 , 1 17) of the neural network (1 10), wherein the original array of original input data values (1 1 1 , 1 17) of the neural network (1 10) comprises the array of input data values (1 17) or another array of input data values
(1 1 1 ) associated to the array of input data values (1 17).
6. The data processing apparatus (100) of any one of claims 3 to 5, wherein the array of input data values (1 17) and the array of output data values (121 ) are two-dimensional arrays and the neural network layer (120) is configured to generate a kernel of the plurality of position dependent kernels (1 18) on the basis of the following equation:
Figure imgf000030_0003
Figure imgf000030_0001
wherein Ff (x, y) denotes the plurality of Nf position dependent weights (1 19a) and Kf denotes the plurality of position independent kernels (1 19b).
7. The data processing apparatus (100) of any one of the preceding claims, wherein the neural network layer (120) is a convolutional network layer or an aggregation network layer.
8. The data processing apparatus (100) of any one of the preceding claims, wherein the array of input data values (1 17) and the array of output data values (121 ) are two- dimensional arrays, wherein the neural network layer (120) is a convolutional network layer configured to generate the array of output data values (121 ) on the basis of the following equation:
Figure imgf000030_0004
with
Figure imgf000030_0002
wherein out{x, y, c0) denotes the array of output data values (121 ), in(x, y, c{) denotes the array of input data values (1 17), r denotes a size of each kernel of the plurality of position dependent kernels wL(x, y, c0, cu i,j) and wL(x, y, c0) denotes a normalization factor, or, wherein the neural network layer (1 20) is an aggregation network layer configured to generate the array of output data values (1 21 ) on the basis of the following equation:
Figure imgf000031_0001
with:
Figure imgf000031_0002
wherein out(x, y) denotes the array of output data values (1 21 ), in(x, y) denotes the array of input data values (1 1 7) , r denotes a size of each kernel of the plurality of position dependent kernels wL (x, y, i,fi and WL (x, y) denotes a normalization factor.
9. The data processing apparatus (1 00) of any one of claims 1 to 7, wherein the neural network layer (1 20) is a correlation network layer configured to generate the array of output data values ( 1 21 ) from the array of input data values ( 1 1 7) and a further array of input data values by: correlating the array of input data values (1 1 7) with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels (1 1 8), or correlating the array of input data values (1 1 7) with the further array of input data values and applying a position dependent kernel of the plurality of position dependent kernels (1 1 8) associated to the array of input data values (1 1 7) and a further position dependent kernel of a plurality of further position dependent kernels associated to the further array of input data values.
1 0. The data processing apparatus (1 00) of claim 9, wherein the array of input data values (1 1 7) , the further array of input data values and the plurality of position dependent kernels (1 1 8) are respective two-dimensional arrays and wherein the correlation neural network layer (120) is configured to generate the array of output data values (121 ) on the basis of the following equation:
Figure imgf000032_0002
with:
Figure imgf000032_0003
wherein out(x, y) denotes the array of output data values (121 ), inl(x, y) denotes the array of input data values (1 17), in2(x, y) denotes the further array of input data values, denotes a size of each kernel of the plurality of position dependent kernels
Figure imgf000032_0005
and WL (x, y) denotes a normalization factor, or
Figure imgf000032_0001
with:
Figure imgf000032_0004
wherein out(x, y) denotes the array of output data values (121 ), inl(x, y) denotes the array of input data values (1 17), in2(x, y) denotes the further array of input data values, r denotes a size of each kernel of the plurality of position dependent kernels
Figure imgf000032_0006
and of each kernel the plurality of further position dependent kernels and
Figure imgf000032_0007
WL(x, y) denotes a normalization factor.
1 1 . The data processing apparatus (100) of any one of claims 1 to 7, wherein the neural network layer (120) is configured to generate a respective output data value of the array of output data values (121 ) by determining a respective input data value of a respective sub-array of input data values of the plurality of sub-arrays of input data values being associated with a maximum or minimum kernel value of a position dependent kernel and using the respective determined input data value as the respective output data value.
12. A data processing method (1000) comprising:
generating (1001 ) by a neural network layer (120) of a neural network (1 10) from an array of input data values (1 17) an array of output data values (121 ) based on a plurality of position dependent kernels (1 18) and a plurality of sub-arrays of the array of input data values (1 17).
13. A computer program comprising program code for performing the method (1000) of claim 12, when executed on a computer and/or a processor.
PCT/EP2017/057088 2017-03-24 2017-03-24 Neural network data processing apparatus and method WO2018171899A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/EP2017/057088 WO2018171899A1 (en) 2017-03-24 2017-03-24 Neural network data processing apparatus and method
EP17713634.8A EP3590076A1 (en) 2017-03-24 2017-03-24 Neural network data processing apparatus and method
CN201780088904.6A CN110462637B (en) 2017-03-24 2017-03-24 Neural network data processing device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2017/057088 WO2018171899A1 (en) 2017-03-24 2017-03-24 Neural network data processing apparatus and method

Publications (1)

Publication Number Publication Date
WO2018171899A1 true WO2018171899A1 (en) 2018-09-27

Family

ID=58413093

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/057088 WO2018171899A1 (en) 2017-03-24 2017-03-24 Neural network data processing apparatus and method

Country Status (3)

Country Link
EP (1) EP3590076A1 (en)
CN (1) CN110462637B (en)
WO (1) WO2018171899A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10929665B2 (en) 2018-12-21 2021-02-23 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156845A (en) * 2015-03-23 2016-11-23 日本电气株式会社 A kind of method and apparatus for building neutral net
CN106156807B (en) * 2015-04-02 2020-06-02 华中科技大学 Training method and device of convolutional neural network model
US20160358069A1 (en) * 2015-06-03 2016-12-08 Samsung Electronics Co., Ltd. Neural network suppression
CN105096279A (en) * 2015-09-23 2015-11-25 成都融创智谷科技有限公司 Digital image processing method based on convolutional neural network
CN106485319B (en) * 2015-10-08 2019-02-12 上海兆芯集成电路有限公司 With the dynamically configurable neural network unit to execute a variety of data sizes of neural processing unit
CN105512723B (en) * 2016-01-20 2018-02-16 南京艾溪信息科技有限公司 A kind of artificial neural networks apparatus and method for partially connected
CN105913117A (en) * 2016-04-04 2016-08-31 北京工业大学 Intelligent related neural network computer identification method
CN106066783A (en) * 2016-06-02 2016-11-02 华为技术有限公司 The neutral net forward direction arithmetic hardware structure quantified based on power weight
CN106407903A (en) * 2016-08-31 2017-02-15 四川瞳知科技有限公司 Multiple dimensioned convolution neural network-based real time human body abnormal behavior identification method

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
BERT DE BRABANDERE ET AL: "Dynamic Filter Networks", 6 June 2016 (2016-06-06), XP055432972, Retrieved from the Internet <URL:https://arxiv.org/pdf/1605.09673.pdf> [retrieved on 20171207] *
DI KANG ET AL: "Crowd Counting by Adapting Convolutional Neural Networks with Side Information", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 November 2016 (2016-11-21), XP080732861 *
J. LONG; E. SHELHAMER; T. DARRELL: "Fully Convolutional Networks for Semantic Segmentation", CVPR, 2015
JAMPANI VARUN ET AL: "Learning Sparse High Dimensional Filters: Image Filtering, Dense CRFs and Bilateral Neural Networks", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 27 June 2016 (2016-06-27), pages 4452 - 4461, XP033021632, DOI: 10.1109/CVPR.2016.482 *
KAIMING HE; JIAN SUN; XIAOOU TANG: "Guided Image Filtering", ECCV, 2010
LI YIJUN ET AL: "Deep Joint Image Filtering", 17 September 2016, NETWORK AND PARALLEL COMPUTING; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CHAM, PAGE(S) 154 - 169, ISBN: 978-3-540-28012-5, ISSN: 0302-9743, XP047355387 *
M. ELAD: "On the origin of bilateral filter and ways to improve it", IEEE TRANSACTIONS ON IMAGE PROCESSING, vol. 11, no. 10, October 2002 (2002-10-01), pages 1141 - 1151
SIMON NIKLAUS ET AL: "Video Frame Interpolation via Adaptive Convolution", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 March 2017 (2017-03-22), XP080758826 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10929665B2 (en) 2018-12-21 2021-02-23 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation
US11532154B2 (en) 2018-12-21 2022-12-20 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation
US11847826B2 (en) 2018-12-21 2023-12-19 Samsung Electronics Co., Ltd. System and method for providing dominant scene classification by semantic segmentation

Also Published As

Publication number Publication date
CN110462637B (en) 2022-07-19
CN110462637A (en) 2019-11-15
EP3590076A1 (en) 2020-01-08

Similar Documents

Publication Publication Date Title
US11687775B2 (en) Neural network data processing apparatus and method
US9418458B2 (en) Graph image representation from convolutional neural networks
US20190108618A1 (en) Image signal processor for processing images
US9454810B2 (en) Correcting chrominance values based onTone-mapped luminance values
US20220215588A1 (en) Image signal processor for processing images
JP4460839B2 (en) Digital image sharpening device
CN109214403B (en) Image recognition method, device and equipment and readable medium
CN109816612A (en) Image enchancing method and device, computer readable storage medium
US20120224789A1 (en) Noise suppression in low light images
JP6961640B2 (en) Data processing system and method
Liu et al. Image de-hazing from the perspective of noise filtering
CN110809126A (en) Video frame interpolation method and system based on adaptive deformable convolution
CN111340732A (en) Low-illumination video image enhancement method and device
US9672447B2 (en) Segmentation based image transform
US11893710B2 (en) Image reconstruction method, electronic device and computer-readable storage medium
WO2018171899A1 (en) Neural network data processing apparatus and method
CN111932472A (en) Image edge-preserving filtering method based on soft clustering
Silverman et al. Segmentation of hyperspectral images based on histograms of principal components
CN110647898B (en) Image processing method, image processing device, electronic equipment and computer storage medium
CN108986052B (en) Self-adaptive image illumination removing method and system
Ishida et al. Shadow detection by three shadow models with features robust to illumination changes
Sonawane et al. Adaptive rule-based colour component weight assignment strategy for underwater video enhancement
CN116917954A (en) Image detection method and device and electronic equipment
US20190188512A1 (en) Method and image processing entity for applying a convolutional neural network to an image
Girish et al. One network doesn't rule them all: Moving beyond handcrafted architectures in self-supervised learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17713634

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017713634

Country of ref document: EP

Effective date: 20191002